A Memory-efficient Implementation of Perfectly Matched Layer with Smoothly-varying Coefficients in Discontinuous Galerkin Time-Domain Method

Wrapping a computation domain with a perfectly matched layer (PML) is one of the most effective methods of imitating/approximating the radiation boundary condition in Maxwell and wave equation solvers. Many PML implementations often use a smoothly-increasing attenuation coefficient to increase the absorption for a given layer thickness, and, at the same time, to reduce the numerical reflection from the interface between the computation domain and the PML. In discontinuous Galerkin time-domain (DGTD) methods, using a PML coefficient that varies within a mesh element requires a different mass matrix to be stored for every element and therefore significantly increases the memory footprint. In this work, this bottleneck is addressed by applying a weight-adjusted approximation to these mass matrices. The resulting DGTD scheme has the same advantages as the scheme that stores individual mass matrices, namely higher accuracy (due to reduced numerical reflection) and increased meshing flexibility (since the PML does not have to be defined layer by layer) but it requires significantly less memory.

In DGTD, the PML conductivity profile can be implemented in two different ways. The first method assumes that the conductivity is constant in a given element. This constant might for example be set to the value of the conductivity profile at the center of the element. Implementation of this method is rather straightforward since the mass matrices of different elements only differ by a constant (for linear elements) [8]. However, the conductivity profile becomes discontinuous between neighboring elements [ Fig. 1 (a)] and the element surfaces that are not parallel to the PML interface lead to large reflections and destroy the high-order accuracy of the solution. One workaround is to build layered (tetrahedral) meshes [ Fig. 1 (b)] or use orthogonal (hybrid) meshes inside the PML [7], [14], and accordingly set a layered conductivity profile. But this makes the setups of the computation domain and the PML rather tedious since one needs to control the mesh and the conductivity on all face, edge, and corner regions of the PML. Moreover, to reduce the numerical reflection by decreasing the conductivity discontinuity between neighboring mesh (or conductivity) layers, their number has to be increased.
The second method allows the conductivity to vary inside a given element (following the increasing conductivity profile inside the PML). It should be noted here that (higher-order) DGTD allows for sampling of the material properties at the sub-elemental level. In this case, the behavior of the PML is determined only by the conductivity samples within the elements and therefore the mesh interfaces can be aligned arbitrarily [ Fig. 1 (c)] without adversely affecting its performance. Indeed, this second approach can improve the PML performance [15]- [17].
However, the conductivity varying within the elements results in an element-dependent mass matrix. Therefore, a direct implementation requires a different mass matrix (or its inverse) to be stored for every element [7], [9], [14]- [16], [18]. This significantly increases DGTDs memory footprint. For example, for the stretched-coordinate (SC)-PML [9], the memory required to store the mass matrices scales with 15K PML × N 2 p , where N p is the number In contrast, for the first method, where the conductivity is assumed constant in a given element, this memory requirement scales with K PML since only the constant conductivity of each element and a single reference mass matrix are stored.
In this work, a memory-efficient method to implement the SC-PML with smoothly-varying attenuation coefficients in DGTD is developed. The proposed method allows the conductivity to vary inside the elements and constructs the resulting local mass matrices using a weight-adjusted approximation (WAA) [19]. Compared to the direct implementation that is briefly described above, the proposed method reduces the memory requirement to 15K PML × N q , where N q ∼ N p , while maintaining the PML performance. It should be noted here that the WAA has been proved to be energy-stable and preserve the high-order convergence of DGTD [19]- [21]. Indeed, numerical examples presented here also show the proposed method maintains the higher-order accuracy of the solution. Additionally, it is demonstrated that the PML with smoothly-increasing conductivity profile as implemented with the proposed method performs better than the PML implemented using element-wise constant conductivity profile.

A. WAA-DGTD for SC-PML
Maxwell equations in stretched-coordinates for a source-free and lossless medium can be expressed as [2] −jωµH = ∇ e × E (1) where E and H are the electric and magnetic fields, and µ are the permittivity and the permeability, ω is the frequency, and The coordinate-stretching variables s u , u ∈ {x, y, z}, are defined as [2], [9], [13] where κ u and σ u are one-dimensional positive real functions along direction u. Here, σ u is the attenuation coefficient that ensures absorption inside the SC-PML and κ u changes the propagation speed (and also increases the absorption for evanescent waves). It is well known that using smoothly-increasing σ u and κ u reduces the numerical reflection from the interface between the PML and the computation domain while maintaining a high absorption rate, and therefore improves the PML performance [11]- [13].
The time-domain update equations in SC-CPML can be expressed as [9] ∂ tä · µH = −∇ × E −b · µH −c · µP H (5) where P E and P H are auxiliary variables introduced to avoid computationally costly temporal convolutions while converting (1) and (2) into time domain [9] andä,b,c,d, andκ are diagonal tensors with entries defined as In (9) and the rest of the text (u, v, w) follows the permutation (x, y, z) → (y, z, x) → (z, x, y).
Following the standard nodal discontinuous Galerkin method [8], the computation domain and the PML are discretized into K elements with volumetric support Ω k and boundary surface ∂Ω k , and in each element E, H, P E and P H are expanded using the Lagrange polynomials i (r) [8], i = 1, · · · , N p , where N p = (p+1)(p+2)(p+3)/6 is the number of interpolating nodes and p is the order of the Lagrange polynomials. Finally, Galerkin testing yields the semi-discrete system of equations as Here,H k ,Ē k ,P H k , andP E k are vectors storing the unknown time-dependent coefficients of the relevant basis functions,M k andM α k , α ∈ {a, b, c, d, 1/κ}, are the mass matrices with entries C k (f k , f k , g k , g k ) denotes the curl operator with its component along direction u is given bȳ which in general involves unknowns from the current element k and its neighboring element k [8], [22], [23],S k andF k are the stiffness and the face mass matrices with entries and k and µ k are the permittivity and the permeability (assumed constant) in each element. Note that, the nodal DG framework [8] is used here, but the proposed method can be easily extended to vector DG methods [6], [7], [9], [24]- [26].
For linear elements, the mass matricesM k in (14) are simply scaled versions of the mass matrixM defined on the reference element,M k = J kM , where J k is the Jacobian of the coordinate transformation between element k and the reference element. Hence, onlyM and (scalar constant) J k are stored. Similarly, in (15), if α uu (r) is assumed constant inside the elements, i.e., α uu (r) = α k uu , thenM α,u k = α k uuMk = α k uu J kM and one only needs to store α k uu in addition toM and J k . In this case, (10)- (13) can be implemented as efficiently as the case without the PML [8], [27], [28].
However, if α uu (r) is allowed to vary inside the elements,M α,u k are different in different elements and in general there is no simple relationship between these different mass matrices. One has to store every one of these mass matrices (or their inverse). As an alternative, the mass matrix can be recomputed at each time step, but this would significantly increase the cost of time marching [8]. The memory required to store the mass matricesM α,u k in (10)-(13) scales with 3 × 5 × N 2 p per element, where 3 is the number of the (x, y, z) components of the vector field, 5 is the number of the coefficients a(r), b(r), c(r), d(r), and κ(r). Note that this memory requirement is significantly higher than that of storing the unknowns coefficients of the basis functions, which scales with 12 × N p in the PML.
To reduce the memory requirement of implementing (15) with α uu (r) allowed to vary inside the elements, WAA [19] is used. It has been shown that with this approximation DGTD retains provable energy-stability and high-order accuracy [19]- [21]. Note that in the above SC-PML formulation, directly multiplying (5) and (6) witḧ a −1 on both sides reduces the number of element-dependent mass matrices to 4. But this would result in a nonconservative form, whose solution is neither provably energy-stable nor provably high-order accurate [8], [19].
First, a weight-adjusted inner product is introduced to approximate the parameter-weighted inner product in the expression of the mass matrix [19]. The mass matrix, which is associated with the element k and a locally varying Since (M α k ) −1 is used in (10)-(13) (for α = a), one needs to calculateM Under the nodal DG framework [8], where r q , q = 1, ..., N q , are the Gaussian quadrature nodes corresponding to the quadrature rules of degree 2p + 1 and w q are the corresponding weights. Hence,M whereV q is an interpolation matrix defined on the reference element,V q =V IV −1 ,V I andV are generalized Van- is a diagonal matrix containing the coefficients evaluated at the quadrature nodes.

B. Computational complexity
In DGTD with explicit time marching, all operations are localized within the elements. The memory required to store the mass matrices in the direct implementation of (10)-(13) scales with K PML × 15N 2 p , where 15 comes from the number of unknown components times the number of different mass matrices associated with different coefficients and K PML is the number of elements in the PML. In the WAA formulation (26)- (29), the memory requirement reduces to (K PML × 15N q ) + 2N p N q , where 15N q comes from the number of unknown components times the number of coefficient samples at the quadrature points and 2N p N q comes fromV q andP q defined on the reference element. For simplicial quadrature rules that are exact for up to polynomials of degree 2p + 1, [32].
To compare the number of arithmetic operations required by the two implementations, one should first note that the curl operatorC is the same in both formulations. Computation ofC requires those of the spatial derivatives and the numerical flux [33]- [35]. Here, the memory access time is much more significant than the time required to carry out these computations because data from neighboring elements, which are discontinuous in memory, is required. Therefore, only the times required to complete the arithmetic operations of the remaining terms are compared.
For the same reason, in practice, the time required to computeC dominates the overall time required by the time marching, and the difference in the numbers of arithmetic operations as estimated below for the remaining terms is less significant (see the example in Section III).
In (10), the three matrix-vector multiplications and two vector-vector additions require 3N 2 p multiplication operations and 2N p addition operations, respectively. In (26), the multiplication ofV q with a vector of length N p , and the multiplication ofP q with a vector of length N q require N q N p multiplication operations. The multiplication of a diagonal matrix with a vector (such asb kv ) requires N q multiplication operations. As a result, excluding the computation ofC, (26) requires 5N q N p + 3N q multiplications and N q + N p additions. For the auxiliary variable, the cost of (12) is 3N 2 p multiplications and N p subtractions, while (28) requires 3N q N p + 2N q multiplications and N q subtractions. One can see the number of operations in the WAA implementation is slightly higher than that in the direct implementation. But as mentioned above the time required by these operations is smaller than the time required to computeC, and therefore overall times required by the two implementations are not that different. For configuration (i), the constant values in a given element are obtained by sampling σ u and κ u , u ∈ {x, y, z}, at that element's node that is farthest away from the PML interface (along the ±u-direction). For configuration (ii), to ensure that the element surfaces are strictly parallel to the axes, the PML mesh is built layer by layer and constant values in a given layer are obtained by sampling σ u and κ u , u ∈ {x, y, z}, at the outermost surface of that layer (along the ±u-direction). For the WAA in implementation (iv), the order of the Gaussian quadrature rule is 2p, resulting in N q ∈ {4, 11, 23, 44, 74} [32].
In all examples, the background medium is free space and the excitation is a plane wave with electric field ensure that the reflected field is well-separated from the incident one, and therefore the reflection from the PML is simply measured by the peak value of the reflected fields amplitude. The conductivity profile is described by where z 0 is the z-coordinate on the interface between PML and the computation domain, L z is the thickness of the PML and p σ is the order of the profile. Note that σ z (z) is nonzero only when |z| > |z 0 |. The values of these parameters are z 0 = ±30 cm, L z = 1.6 cm, and p σ = 1, and also κ z (z) = 1 both inside the PML and the computation domain.
In this example, four configurations/implementations are considered: EC-paved, EC-layered, SV-paved, and SV-WAA-paved. Their performances are compared for p ∈ {2, 3, 4, 5}. For all four groups of simulations, σ max is scanned to find the minimum reflection that can be obtained for each case. Fig. 2 shows that with increasing σ max , the reflection first decreases exponentially and then increases gradually. This is observed for all configurations/implementations and all values of p. When σ max is small, the overall reflection is dominated by the reflection from the PEC boundary simply because the absorption inside the PML is not high enough. Therefore, in this regime, increasing σ max elevates the absorption and reduces the amplitude of the wave reflected back into the computation domain exponentially. The numerical reflection (which is smaller than the reflection from the PEC boundary for small σ max ) increases with increasing σ max [11], and starts dominating the overall reflection as demonstrated in the figure by the gradual increase after the minimum point.
For the EC-paved configuration, the reflection stays at a high level and does not decrease with increasing p.
This is because of the large reflections from unoriented internal element surfaces. For the EC-layered, SV-layered, and SV-WAA-paved configurations, high-order convergence is observed, i.e., the reflection keeps on decreasing exponentially with increasing p. Still, the reflection for the SV-paved and SV-WAA-paved configurations is about 15 dB smaller than that for the EC-layered configuration. Note that this higher accuracy comes with the ease of meshing since a layered mesh (and conductivity profile) is not needed. Finally, Fig. 2 also shows that the SV-WAA-paved implementation performs exactly the same as the SV-paved direct implementation, which verifies the accuracy of the proposed method.
Next, scattering from a PEC sphere of radius 1 cm is considered. The computation domains and the PMLs for Because the distance between the sphere surface and the PML is short, possibly-evanescent scattered waves enter the PML with high grazing angles. A varying κ u , u ∈ {x, y, z}, profile is employed to help with the absorption of these evanescent waves [12], [13]: κ u (u) = 1 + (κ max − 1)[(u − u 0 )/L u ] pσ with κ max = 2. Note that inside the computation domain, κ u (u) = 1. In this example, using the PEC or the first-order absorbing boundary condition [15] on the outer boundary of the PML gives similar results. The results presented here are obtained with the PEC boundary condition. σ max is scanned to find the minimum reflection that could be reached for each case. Note that in this example "reflection" is defined as the peak value of the absolute difference between the fields computed at a probe point for the above cases and corresponding reference fields computed at the same point.   region are kept the same as those in the actual computation domains, and the solutions are obtained using the same value of p.  Fig. 4 also shows that the SV-WAA-paved implementation performs exactly the same as the SV-paved direct implementation, which means the error caused by the WAA of the mass matrices is below the level of the discretization error. Table. I compares the computational cost of the SV-paved and SV-WAA-paved implementations. With increasing p, the memory requirement increases dramatically for the SV-paved direct implementation but only modestly for the SV-WAA implementation. For p = 5, the memory requirement of the SV-paved implementation is 12.4 times that of the SV-WAA-paved implementation. The computation time required by the SV-WAA-paved implementation per time step is slightly larger than that required by the SV-paved implementation due to the increased number of arithmetic operations (see Section. II-B). It should be also be noted here that, in practice, a DGTD algorithm is usually parallelized. The difference in times required for updating different elements is relatively small and can be easily compensated by allocating a smaller number of elements for those MPI processes containing PML elements.
In the numerical results presented here, assigning a weight of 2 for PML elements in ParMetis [36], [37] yields a good load-balance.

IV. CONCLUSION
A PML implementation that allows the attenuation coefficient to vary inside the discretization elements yields a smaller numerical reflection from the interface between the PML and the computation domain and significantly simplifies the meshing process. However, these advantages come at the cost of increased memory footprint since a different mass matrix has to be stored for every discretization element. In this work, this memory requirement is reduced by applying WAA to the mass matrices without abandoning the advantages listed above. Indeed, numerical results demonstrate that the PML with smoothly-increasing conductivity profile as implemented with the proposed method performs better than the PML implemented using element-wise constant conductivity profile and that the higher-order accuracy of the solution is maintained.
The proposed method is especially useful for simulations running on shared-memory systems where the high memory requirement of smoothly-varying PMLs could be a bottleneck. For simulations running on distributedmemory systems, the memory requirement of a single computing node is also reduced and a better load-balance could be reached with a slightly adjusted weight in the domain partition.