Using Open Source to Accelerate Development of a Finite Element-Boundary Integral Code

Open-source software has been highly influential on software development in many fields, and also has a history within computational electromagnetics. With large amounts of open-source code available, both from within computational electromagnetics and from other fields, new combinations can be made by using already existing code packages. This can be especially beneficial to developers who do not otherwise have access to a substantial codebase. In this article we describe how a finite element-boundary integral code using the adaptive cross approximation was developed by combining different existing open-source software packages with new code in Python. We provide a brief overview of the numerical methods used, but our focus is on the implementation and insights that might be useful to others who could benefit from using open-source software in their work. Three numerical examples are also presented to demonstrate accuracy, performance and use of complex materials. Our code is provided at github.com/nwingren/fe2ms both to demonstrate how the open-source packages were combined in practice, but also for those who wish to test the code themselves.


I. INTRODUCTION
O PEN source is a term used in software development for making the source code of a program freely available for inspection, modification and redistribution.In computational electromagnetics, this has been used at least since early versions of the numerical electromagnetics code (NEC) developed in the 1970s, which was particularly important for development related to the method of moments (MoM) [1], [2].Today there are many other open-source codes suitable for electromagnetics such as the finite-difference time-domain (FDTD) codes Meep [3] and gprMax [4] or the finite element method (FEM) codes Elmer [5] and FEniCSx [6], [7], [8], [9], [10].Much more code development happens outside of CEM though, and utilizing these resources can be a great help when working with open-source software.Although some parts of a code are specific to CEM, there are more generic parts which can benefit from combining code from other fields.
Combining code from different sources can prove challenging at times, especially if the codes are written in different programming languages.Traditionally in computational applications, fast compiled languages like Fortran or C++ have been used, however, they suffer from low flexibility and time consuming development compared to languages like Python or Julia [11], [12].While Julia is in many regards superior to Python [12], it still lacks the sheer number of native packages that Python has available.This can be exemplified by comparing the Python Package Index with about 500 000 packages [13] to the much fewer number of about 10 000 registered packages for Julia [14] (as of December 2023).Many highly developed computational packages also have native Python interfaces, like the aforementioned Meep, gprMax and FEniCSx.While the nominal performance of Python is quite poor compared to many other languages, it can be improved by using packages like NumPy [15] and SciPy [16] with fast underlying code.More recently, performance gain has also become possible through just-in-time (JIT) compilation using tools like Numba [17].
One use case which can benefit from a combination of different computational codes is the development of a hybrid code, as there might then exist good implementations of the different constituent parts that are not yet combined.The finite element-boundary integral (FE-BI) method is a hybrid method combining FEM and MoM.Its origins are from outside electromagnetics [18], [19], [20], but there is a long history within the field as well [21], [22], [23].From an FEM point of view, it can be seen as an exact boundary condition for open boundaries, playing a similar role as absorbing boundary conditions like perfectly matched layers (PML) [24].In certain situations, the hybrid method can outperform other open boundary methods in FEM.An example could be multiple reflecting geometries or widely separated scatterers, where, for example, enclosing the computational region with a PML would require discretization of large regions of free space.
Pure MoM is widely used, and its surface formulations perform particularly well for problems involving perfect electric conductors (PEC) or homogeneous dielectric structures [25], [26], [27].However, more expensive volumetric formulations are typically required for problems containing inhomogeneous dielectrics [28], or more complex media containing anisotropy [29], [30] or magnetoelectric interaction [31], [32].With the FE-BI method, all these cases can be handled with ease [33].One more notable detail about MoM is the need for methods that reduce the required memory and accelerate the solution process as the problem size grows.Many different methods exist such as the multilevel fast multipole algorithm (MLFMA) [34], [35], the adaptive cross approximation (ACA) [36], [37] and the adaptive integral method (AIM) [38].The same need for such acceleration methods is present for the FE-BI hybrid method as the BI part dominates with regards to memory and computational complexity, and many different schemes have been applied [39], [40], [41].This is a natural consequence of dense matrices arising from integral equations in MoM, as opposed to sparse matrices arising from differential equations in FEM.Recent work related to the FE-BI method has focused on, among others, domain decomposition [42], [43], [44], H-matrix methods for fast direct solution [41] and preconditioning [45], [46].
The purpose of this article is to describe our implementation of the FE-BI method done by combining many different open-source packages in a way that can help readers who might be interested in using the same approach in their own work.We describe the method briefly, but the focus is primarily on the development process and what we have learned from it.Some results showing the capability of the code are also included to demonstrate that reasonable performance is achieved by using a multilevel ACA, and that the code can be used for general problems involving linear bianisotropic media.The code is written primarily in Python and is publicly available at github.com/nwingren/fe2msunder the GNU general purpose license, version 3. The code is mainly based on the FEniCSx package for FEM computations, but the article also describes how many other packages, as well as new code, are combined into the presented FE-BI code.The article is structured as follows.Firstly, in Section II the FE-BI method is introduced briefly, as well as the multilevel ACA.In Section III the development of the code is described, with emphasis on how it was done and insights others might benefit from.In Section IV the different components of the code are described.In Section V three different scattering problems are presented, and the results demonstrate the capabilities and accuracy of the code.Finally, the article is concluded in Section VI.
To introduce the mathematical formulation of the FE-BI method, a general scattering problem geometry is shown in Fig. 1.A volume with boundary ∂ is considered.Within , the problem is described by the finite element method.On ∂ the boundary integral method is used as a boundary condition which describes the physics of the open boundary without approximation.
The FE formulation depends on the constitutive relations of the problem media.In this work, we distinguish between general linear magnetoelectric media as and more simple linear media as where ε 0 , μ 0 and η 0 are the free space permittivity, permeability and wave impedance, respectively, whereas ε r , μ r , ξ and ζ are unitless tensor quantities describing the medium.The distinction follows into the governing equations used in the FE formulation.These are for general linear magnetoelectric media [33] and for more simple linear media [24].By enforcing a boundary condition on ∂ , the FE system matrix entries can be derived as for general linear magnetoelectric meda [33] and for more simple linear media [24], as well as in both cases [24].N are curl conforming basis functions and X, Y ∈ {I, S}.I and S indicate that the quantity is connected to mesh edges on the interior of and on the boundary ∂ , respectively.The full FE-BI system is written as [24] ⎡ ⎣ where P, Q and b E are determined by connecting E and H on ∂ using an integral equation.This is a partly sparse, partly dense system due to the inherent characteristics of the FE and BI parts respectively, as is illustrated in Fig. 2.
With this system, it is assumed that the problem is a pure scattering problem, as the right-hand side would otherwise have additional non-zero entries.Interior sources can be added to model, for instance, a transmitting antenna.Different integral formulations can be used to obtain P, Q and b E , and they have different benefits and drawbacks.An easily implementable formulation is the electric field integral equation (EFIE) using the TE formulation from [39], giving the different blocks as b where are divergence conforming basis functions, K and L are integral operators connecting magnetic currents to electric fields, and electric currents to electric fields, respectively, and E inc is the incident electric field.The tilde on the K operator indicates that the integral operator is computed in a principal value sense due to a singularity extraction.One drawback of EFIE-based systems is that they are susceptible to interior resonances which can result in inaccurate solutions at certain frequencies, as well as poor performance for iterative solvers in a band around those frequencies [39].This can be remedied using other formulations based on the combined field integral equation (CFIE), though they typically require different test functions than shown in ( 12)-( 13) to be fully effective.There is a possibility to reduce some effects of interior resonances through the TETH (meaning tangential testing of E and H) CFIE formulation from [39] which only uses linear combinations of ( 12) and ( 13).This does not eliminate interior resonances though, but it typically improves the performance for iterative solvers at frequencies near a resonance.Another way to eliminate the interior resonance problem is to use one of the stationary FE-BI formulations described in [47], which also results in a symmetric system matrix.

B. THE MULTILEVEL ADAPTIVE CROSS APPROXIMATION
To improve performance of the method, the dense matrices P and Q from the BI part are assembled on an Hmatrix form using a multilevel adaptive cross approximation (ACA).The direct effect of this is that representations of the matrices can be stored using much less memory than by storing them fully.Furthermore, if the compression achieved using this representation is significant, matrixvector multiplications can be computed with fewer operations than would be needed using a full matrix.The result of using the multilevel ACA is thus reduced memory usage as well as accelerated iterative methods, which are based on matrix-vector multiplications.The ACA is only one such acceleration method among many, but one advantage it has is that the algorithm itself is purely algebraic.This means that the same implementation of the ACA can be used for many different kernels without modification, and experimentation with different formulations in the BI part of the method is therefore easy.With that said, it is good to keep in mind that the performance of the method depends on the problem it is applied to, and the type of clustering that is used.The ACA has been thoroughly described in the literature [36], [37], [48] and will only be briefly presented here.The main point of the algorithm is compression of rankdeficient matrices.In general, an m × n matrix A of rank k can be written on the form where U is m × k and V is n × k.The ACA generates an approximation of matrices on this form by performing assembly only of some rows and columns of the matrix A.
In addition, there is a stopping criterion based only on a desired error tolerance, requiring no prior knowledge of k.For efficient compression using the ACA, the matrix A needs to have a numerical rank k m, n (for some error tolerance).This is typically not the case for an entire BI system matrix, but rank-deficient and compressible subblocks can be found.This subdivision into blocks is typically done by geometric clustering of unknowns using an octree, which creates groups at different levels as illustrated in Fig. 3. Assembly is then performed fully for interactions within groups and between directly adjacent groups at the leaf level of the octree.For non-adjacent groups, assembly is performed using the ACA, and this can be done in a multilevel way.The structure of a BI matrix A after this procedure is where A near is a sparse matrix representing the near interactions (including singular terms), and the sum is over all group interactions assembled using the ACA.The blocks assembled using the ACA can be further compressed using a recompression technique based on the QR and singular value decompositions (SVD) [49].

III. IMPLEMENTATION USING OPEN-SOURCE SOFTWARE A. BACKGROUND
The code described in this paper has its origin in a project named Fast and Efficient ElectroMagnetic Solvers (hence the abbreviation FE2MS for the code presented here) where a FE-BI hybrid code was needed, and at an early stage it was decided that it would use existing open-source software.There were many considerations leading to this decision, but a crucial factor was that we did not have access to an existing in-house code to build upon.The options at the outset were to either implement a complete code from scratch, or investigate whether there was existing open-source software which could be used.The latter offered a possibility of much faster development, which was appealing, especially since the active code development was performed by the first author of the paper as a single developer.As will be discussed in the following sections, this open-source approach can range from using fairly mature open-source packages with only small additions, to more substantial development with open-source code used for specific functions.

B. INITIAL IMPLEMENTATION
A first implementation of a FE-BI hybrid code was based on the two open-source packages FEniCSx and Bempp-cl.FEniCSx is a highly versatile FEM package consisting of the components DOLFINx [6], FFCx [7], Basix [8], [9] and UFL [10].Bempp-cl is a package for boundary element methods with support for a number of operators including some based on Maxwell's equations [50].Particularly interesting was that Bempp-cl was designed to be compatible to FEniCSx, with demos available to show how to connect them.To use these packages would mean that there could be more focus on testing various formulations and solution methods and less on developing low-level functionality.
Both FeniCSx and Bempp-cl have interfaces in Python, using the rich framework for scientific computing available through NumPy and SciPy.While FeniCSx has its core written in C++ for performance, Bempp-cl is written almost purely in Python with performance obtained using Numba and OpenCL.For an end user, these distinctions are not necessary to keep in mind though as both packages have simple public interfaces.Python was selected as the main programming language primarily due to these two packages, and this facilitated further connections between different packages.One main reason for this is the wide use of NumPy ndarray data structures in Python packages for scientific computing.Since so many packages use this data structure, as well as other functionality from NumPy, it becomes easy to combine software packages that were not designed to be compatible.

Bempp-cl.
The public interface to FEniCSx is simple to use, but still very powerful and versatile.As an example, the essential parts related to defining a weak form and assembling its matrix are considered.Creating the K matrix in (9) using FEniCSx is done in Python as shown in the top row of Table 1.In this, inv_mur and epsr correspond to μ −1 r and ε r , respectively, k0 corresponds to k 0 and u and v are test and trial functions corresponding to N X n and N X m , respectively.This code snippet illustrates how implementing a weak form in FEniCSx is very similar to writing the mathematics explicitly.The way it is written is using the FEniCSx component UFL (Unified Form Language) and is very versatile in what equations can be used.This is then automatically compiled by the component FFCx (FEniCS Form Compiler) into a format that can be assembled by the component DOLFINx.To the user, most of this is hidden by default, making it very easy to define weak forms freely.This can be exemplified by the fact that the much more complicated (8) would only require a trivial amount of additional work to implement in FEniCSx compared to (9) (essentially, two additional lines corresponding to additional integrals are added to form).
Bempp-cl also has a simple public interface, but it is not as versatile as FEniCSx.Instead of freely defining equations to solve, it is based on a number of pre-defined operators that can be used.Creating the matrix Q in ( 13) can be done in Python as shown in the bottom row of Table 1.This contains three variables for function spaces which are defined with appropriate basis functions using other functions from Bempp-cl.
Both FEniCSx and Bempp-cl can generate simple simulation meshes, but they are limited to simple shapes and not suitable for much more than testing purposes.For general geometries, both packages are compatible with the Gmsh mesh format.Gmsh is an open-source meshing software with many other functions, for example related to creating geometries [51].While Gmsh can be run as a standalone program using a graphical user interface or script files, it is also possible to use it through its application programming interface (API).This API is available for Python, and this gives the convenience of writing geometry/meshing code and FEniCSx/Bempp-cl code in the same simulation script, which makes it easy to parametrize the geometry and mesh.For more complicated geometries, this generation method can be difficult to use and in those cases STEP files can be imported to Gmsh.Geometries can then be created in more sophisticated computer-aided design (CAD) software, and exported as STEP files where different files can correspond to different media.The different media as well as boundaries of interest can be tagged in Gmsh so that the information is transferred to the msh file after meshing.This process is illustrated in Fig. 4.

C. IMPLEMENTATION USING CUSTOM BI CODE
The implementation based on FEniCSx and Bempp-cl worked well as a starting point, but it was difficult to modify for more in-depth numerical experiments as the boundary integral formulations that could be used were determined by the operators pre-defined in Bempp-cl.Working with, for example, the ACA would require significant modification to the internal code of Bempp-cl, which was estimated to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
be more difficult than implementing the BI part ourselves.One major reason behind this decision was that much of the infrastructure related to mesh and basis functions could be taken from FEniCSx without much modification.In addition to this, a code for singularity handling in the BI part was available as the open-source package DEMCEM [52], [53], [54], [55].While DEMCEM is a C++ package, it could be used in Python by using the package pybind11 [63].This package makes it possible to use C++ code in Python if additional pieces of code (bindings) are written for this connection.In the case of DEMCEM, this did not require very much effort as the communication needed between Python and C++ was quite limited.
Initial versions of the custom BI code were written in a very basic way which did not lead to high performance, especially due to the typically poor performance of native Python.This was particularly clear from the assembly time for nonsingular matrix entries which dominated the total solution time at every problem size.During further development, two main approaches were used to improve performance: NumPy vectorization and Numba JIT compilation.
The NumPy vectorization approach is based on reformulating code as operations on NumPy arrays.These data structures can be manipulated using highly optimized lower level code in NumPy.By reformulating operations in Python (like loops) into operations on NumPy arrays it can be possible to achieve significant performance improvements.However, this approach can be difficult to generalize to very large procedures such as the assembly of a full matrix block, and the resulting code is often difficult to read and debug.
The Numba JIT compilation approach is based on rewriting the code in a format which can be compiled using the package Numba.This is a package based on the LLVM compiler language, which can compile a subset of the Python language and NumPy into fast machine code, and has support for parallelization [17].If a function is declared to be JIT compiled using Numba, it will be compiled when it is first called and the resulting machine code is stored in a cache for subsequent calls.To achieve optimal performance in Numba, typically the code is structured in a different way compared to typical Python code.For example, iterating over entries in multiple lists is typically done in Python without explicitly defining iteration variables (using the zip command).In Numba, on the other hand, it is preferred to iterate over an indexation variable, making the code seem less "Pythonic" and more similar to what it might look like in lower-level languages.The language is still Python though, and interoperability between Numba functions and regular Python code is mostly seamless.
As a comparison between the different approaches, consider the assembly of an N × N matrix with entries corresponding to the free space Green's function as  2.
where r i is point i on a mesh.The entries for m = n are ignored to only consider the non-singular entries in this example.This computation has similarities to non-singular assembly of BI operators, but is much more simple to make the code easier to follow.The computation can be implemented using standard Python loops, using NumPy vectorization and using Numba JIT compilation as shown in Table 2.For the Numba computation, the outer loop was run in parallel on 6 cores using the prange command.
Computations were performed for k 0 = 1 m −1 with points equally spaced on a square surface with side length 1 m.The run times for an increasing number of points are shown in Fig. 5 for the three approaches.It is seen that NumPy vectorization is significantly faster than the standard Python approach, and Numba is even faster.In more complicated computations, like those found in our actual BI assembly code, the difference between vectorization and Numba can grow even more.With more variables to access and use in computations, it becomes increasingly difficult to write good vectorized code.Loops will almost inevitably have to be introduced at some point for the code to be readable and maintainable.With Numba, on the other hand, loops are fast due to the JIT compilation and it is easy to write readable, fast code for more complicated computations.In our work, this made the performance boost when switching to Numba very dramatic, with an improvement in assembly time of around two orders of magnitude compared to an earlier partly vectorized implementation.

D. WORK WITH THE ACA
One of the main reasons for implementing our own custom BI code was that it would make it easier to experiment with compression methods like the ACA.The core ACA algorithm as described in [36], [37] is not very difficult to implement, and for this reason the work related to fast methods started with our own implementation.In the first steps, a single level algorithm was implemented as this was also a step on the way to a multilevel algorithm.The implementation of a multilevel algorithm did not require much more effort, mainly due to the use of the AdaptOctree package [56] for subdivision of the geometry.Similar to the already existing BI code, the AdaptOctree package uses Numba to improve performance.
As demonstrated in section V, our ACA implementation provides significant compression of system matrices.There are, however, multiple packages available which provide not only implementations of the ACA or similar compression methods, but also a more complete H-matrix algebra [57], [58].These could be highly interesting in extensions to the code as such packages could provide functionality for better preconditioning or fast direct methods.The coupling to the current code would be nontrivial though, as these packages are typically not written in Python and H-matrix assembly requires stronger coupling between different languages than, for example, was required for DEMCEM.Nevertheless, such extensions constitute a natural way forward, and the work required to couple the code with external packages is likely less than what is required to implement the equivalent functionality from scratch.

E. EFFORT SAVED BY USING OPEN SOURCE
To give an idea of how much effort might have been saved by using open-source software as described in this section, data generated using David A. Wheeler's 'SLOCCount' is presented.This includes source lines of code (SLOC), as well as very simple estimated efforts using the Constructive Cost Model (COCOMO) [59].While these metrics should be interpreted with caution, it still gives an idea of the benefit in using existing open-source software in this way.
A comparison between our code (FE2MS) and some of the used packages is shown in Table 3.The counted lines are for the source code, excluding demos and tests, of the versions used in our code as of writing this paper.
The 0.63 person-years estimated development effort for FE2MS can be compared to a calendar-based estimate of 0.7 years, which is reasonably close.This second estimate is based on timestamps from git commits, with some known times of other major activities like teaching, courses and conferences subtracted.This is, of course, also a crude estimate.On one hand, the estimate does not include time for code that was reused from the earlier version based on Bempp-cl, but on the other hand, it includes activities that are unrelated to the direct code development.Regardless, both estimates give an order of magnitude idea of the development times.It is difficult to gauge how close the estimates are for other packages though, and looking at SLOC for the two languages might be more fair, though still not a perfect metric.
A comparison to DEMCEM is interesting as most functionality of this package is used.It is also an example of code in another programming language that is connected to Python through our own pybind11 code.In fact, that connection code exactly represents the 277 C++ SLOC for FE2MS in Table 3, which is fairly simple code as it only interfaces directly to C++ functions in DEMCEM.This can be contrasted to the much larger SLOC number in DEMCEM which also represents significantly more complex code.
While DEMCEM is fairly limited in its scope, FEniCSx is very general and versatile as indicated by its large SLOC numbers in Table 3.Here, all components DOLFINx, FFCx, Basix and UFL are included.Much of this is not used directly in the FE2MS code, and if a bare-bones version would be written it could have been done with significantly fewer SLOC.However, the size and scope of FEniCSx should by no means be seen as unnecessary, as it is precisely this that made it very easy to extend the code from simple dielectrics to general bianisotropic media.
As a final example, AdaptOctree is included in Table 3.This is a small package used only to generate octree data structures from points in the mesh, and this is reflected in the metrics.It would not have been an unreasonable effort to implement this functionality ourselves, but the use of opensource software for that purpose still meant that other things could be prioritized instead.
While all metrics discussed above are uncertain, it is very clear that a substantial amount of development effort has been saved by the use of existing open-source software.Had we opted for developing a similar code completely from scratch, it would probably have taken much longer and with a less versatile and less capable end result.We believe that these insights hold for a wider context than just our work with FE-BI codes as combinations of different functionality is a much more general principle, and there are likely many in the CEM community who could benefit from using more open-source software.

IV. OVERVIEW OF CODE COMPONENTS
In the following, an overview of the components in the code is given, with more attention to how the open-source packages discussed in Section III were combined.For those with a particular interest in the actual code and details on the use of different packages, we refer to the GitHub repository at github.com/nwingren/fe2ms.This repository also includes installation instructions and demos showing how the code can be used, as well as a link to documentation of the code.In the demo script for the coated PEC sphere on GitHub the steps described in this overview can be seen and are commented upon, although not to the same level of detail as here.The most important open-source packages used by the code were introduced in Section III, and an overview of how they work together is shown in Fig. 6.More detailed descriptions of different steps follow.

A. GEOMETRY AND MESHING
The code uses Gmsh to generate geometry and mesh using its Python API.An example of how this is done can be found in the demos on our GitHub repository, but tutorials are also available at the Gmsh website gmsh.info.Mesh generation using STEP files, as described in Section III, can also be used (with tutorials available on the Gmsh website).

B. COMPUTATIONAL VOLUME AND SYSTEM
The parameters for (mesh, boundary conditions and material parameters) are combined as a "computational volume" data structure in the code.This is created by loading a mesh from a msh file into DOLFINx, listing the material properties for different parts of the mesh and defining boundaries.The tags defined in Gmsh are used here to appropriately link properties to mesh elements.From this linking, DOLFINx data structures are generated from the different material properties ε r , μ r , ξ , ζ , as well as boundaries corresponding to ∂ and PEC.When a computational volume has been generated, a system for a specific frequency is defined either with fully assembled matrix blocks, or with BI blocks assembled using the ACA.The FE formulation (conventional or magnetoelectric) is automatically inferred from the material parameters in the computational volume.

C. ASSEMBLY
The assembly is performed differently for the FE and BI parts.For the FE part, FEniCSx is used to efficiently assemble the matrix blocks K XY as sparse matrices.The UFL package in FEniCSx is used to easily generate any necessary FE form, of which there are two different types in this work, depending on whether to include bianisotropy or not as described earlier.The DOLFINx data structures for material and boundary data from the computational volume are used here.Assembled FE blocks are converted to SciPy sparse arrays for use together with BI blocks.
The BI part uses a mesh on ∂ generated from DOLFINx using the previously defined data structures for this boundary.If assembly is to be performed using the ACA, an octree is generated using AdaptOctree for edges in this mesh.The octree is generated with a setting for maximum number of edges per leaf group, and this can be used to iteratively find a tree with a desired mean number of edges per non-empty leaf group.
For both full assembly and ACA assembly, all singular and near-singular entries of BI matrices need to be fully computed.This is done using DEMCEM where entries are computed for basis and testing functions on the same facet, on facets sharing a common edge and on facets sharing a common vertex.For non-singular assembly, basis functions and quadrature are defined on the mesh using Basix.If full assembly is used, assembly is done for all edge pairs not included in the previous step.If ACA assembly is used, non-singular assembly is performed in two steps.The first is full assembly of terms corresponding to near interactions, i.e., leaf groups in the octree which are neighbors.Together with the previously computed singular and near-singular terms, the results are stored as SciPy sparse matrices.Next, assembly the ACA is done for far interactions on each level of the octree.Regardless of the assembly type, JIT compilation with Numba is used in the BI part to improve the Python performance.

D. SOLUTION
If the system is assembled fully, it is possible to use classical direct methods from SciPy such as Gaussian elimination or LU decomposition to solve the problem.In addition to the classical direct methods designed for dense matrices, there are also similar methods designed for sparse matrices.The system matrix for this problem has both sparse and dense blocks though, meaning that neither of those approaches is optimal.Methods designed for dense matrices would require all zeros in the system matrix to be explicitly stored, greatly increasing memory use, while methods designed for sparse matrices would suffer from fill-in due to the large dense regions, both increasing memory use and computation times.Iterative solvers based on matrix-vector multiplication are typically more memory efficient, and many such solvers are available in SciPy.The ACA, as it is implemented at the time of writing, also only supports iterative solvers and is not well suited toward classical direct methods.It is, however, very well suited toward fast direct methods based on H-matrix algebra, but this has not been implemented at the moment of writing as discussed in Section III.
For iterative solvers it is typically crucial to use a good preconditioner for the convergence.This is especially important for the FE-BI method as it is based on a system that is ill-conditioned [39].In this code we use a preconditioner similar to that in [64] which is based on creating a sparse, symmetric system based on the blocks in (11).This system can be written as where P and Q are made sparse by only considering interactions within single mesh elements.By careful inspection of ( 10) and ( 12), it can be seen that B T = −2jk 0 P since the operator K vanishes for interactions within a single planar mesh element.The blocks P and Q are generated using singular and near-singular entries already computed earlier.
A SciPy sparse array corresponding to (18) is created using these and the FE blocks.For a vector y, the preconditioned vector x is given by solving the system Mx = y.This can be solved directly by computing a sparse LU decomposition using UMFPACK through a SciPy interface.This has the advantage of only being required once for all right-hand sides, but on the other hand it can be memory demanding.For a lower memory usage, Mx = y can instead be solved iteratively each time a preconditioned vector is required.This in turn requires an inner preconditioning step for most problem sizes.PETSc [60], [61], [62] is used for this inner iteration through the Python interface petsc4py [65].In either case, right preconditioning is used in iterative solvers.Right-hand sides for the problem can be generated for one or a combination of incident plane waves with given amplitude, polarization and direction.The system can then be solved for this right-hand side using a desired solver and preconditioner.After the solution has been computed, the far field, radar cross section (RCS) and near field can be computed.If multiple right-hand sides are required, for example when computing monostatic RCS, the procedure has to be repeated multiple times, but without needing to recompute the preconditioner.

V. NUMERICAL EXAMPLES
To demonstrate the capabilities and accuracy of the code, three numerical examples are presented.The first example is a coated PEC sphere, which is used to illustrate convergence of the method.The second example is a wind turbine rotor, chosen to illustrate a CAD geometry.The third example shows that bianisotropic material properties can be handled by the code.If not stated otherwise, the ACA with recompression was used with a mean number of nonempty leaf groups set to the square root of the number of BI unknowns and an ACA/SVD tolerance set to 10 −3 .Furthermore, the iterative solver LGMRES [66] was used with a relative tolerance 10 −5 .All simulations were run on a desktop computer with an Intel Core i5-9600K CPU @ 3.70GHz and 96 GB RAM.

A. COATED PEC SPHERE
The first numerical example is scattering by a PEC sphere coated by two isotropic dielectric layers, as shown in the inset of Fig. 7.The PEC sphere radius was 0.15λ 0 and the coatings were 0.05λ 0 thick with λ 0 = 1 m being the freespace wavelength at the simulation frequency 300 MHz.
The geometry was both created and meshed in Gmsh, with element sizes varying between simulations to study the hconvergence.Direct preconditioning was used, except for the one with the most unknowns where iterative preconditioning with LGMRES and Gauss-Seidel inner preconditioning was used.The bistatic RCS was computed in the E-and Hplanes, as well as the absorption cross section from the near fields at ∂ .These were compared to the same values from an exact Mie solution [67].
The bistatic RCS is shown in Fig. 7 for 414 000 FE unknowns and 10 857 BI unknowns computed both using full BI matrices and BI matrices compressed by the ACA.Exact reference values computed using a Mie solution are also shown.Good agreement between the exact Mie solution and the FE-BI solution is seen, both for the solution using full BI matrices and the solution using ACA compression.
The errors were computed for all simulations with varying mesh element sizes, and these are shown in Fig. 8 with errors for the solution in Fig. 7 encircled.Convergence is seen for increasing mesh density, both in the near field (absorption) and far field (RCS).

B. WIND TURBINE ROTOR
For a more geometrically complex and larger problem, scattering by a wind turbine rotor (as shown in Fig. 9) was considered.A simplified model was used where the rotor was constructed as a fiberglass shell with shear webs on the interior of the blades.The frequency was set to 30 MHz.FR-4 was used to model the fiberglass walls due to it being a well-characterized fiberglass material with some similarities to that used in wind turbine blades.Its properties were set to ε r = 4.9 − 0.086j, μ r = 1, corresponding to FR-4 at 1 MHz with a volume fraction of 55 % resin [68].
The geometry was created in FreeCAD and transferred to Gmsh using STEP files.It was then meshed with a total number of 224 484 FE unknowns and 72 000 BI unknowns.For comparison, the same mesh was imported in Feko (using both Gmsh and COMSOL Multiphysics to convert the mesh format) where a FE-BI simulation using the MLFMA was performed.The simulation used the EFIE, linear basis functions and double precision to more closely match similar settings used in our code.The bistatic RCS was computed in the xy-plane for plane waves incident along x and polarized along y and z using both our code and Feko.The results are shown in Fig. 10.The agreement is good where RCS values are high, but discrepancies are seen at lower levels.These discrepancies can be due to the different fast methods used in our code (ACA) and Feko (MLFMA).Both these methods are based on approximations, but not the same ones and as such some differences should be expected.Better agreement could be seen if instead second order basis functions were used in Feko for the same mesh.However, since our method is based on first order basis functions, the same was kept in Feko to keep the performance metrics more comparable.
Due to the large separations between different geometrical regions and the large elongated structures, it could be expected that a compression method based on low rank approximations like the ACA performs well.The theoretical memory required to store the BI matrices P and Q fully for this problem is 166 GB (based on storing two 72 000×72 000 matrices using 128 bit complex floats).This is significantly more than the available memory on the used machine.Table 4 shows memory use and times for some parts of the simulations.It is clear that the use of the ACA results in significant compression, although not as significant as the MLFMA in Feko.This should not come as a surprise as the ACA is theoretically less efficient than the MLFMA for nonstatic problems [24].Looking at preconditioning in Table 4, Feko required much more memory than our code.There could be a multitude of reasons for this, but it is not fully clear to the authors as such details are not typically disclosed for commercial codes.It might be that the preconditioning used in Feko leads to faster convergence than in our code, but the memory use is still very high and this sets a hard limit on what can be computed.In terms of peak memory use, both codes used more than the sum of the two listed components, with Feko still using more than our code.
If the times in Table 4 are compared, our code is slower than Feko for all listed metrics except preconditioning.There are some cases where it could be due to the algorithms themselves, like the time for the solver.Although the iterative solvers were not identical, it could be observed that Feko required fewer iterations than our code.This obviously has a direct impact on the solver time, and also strengthens the hypothesis that the preconditioning in Feko leads to faster convergence than our preconditioning.In cases with many more right-hand sides than this example, such decisions can matter significantly.It should be noted that the code used in our solver, particularly for ACA matrix-vector multiplication, could likely be optimized for a significant performance boost.Similarly, other slow run times in our code can be explained by differences in programming languages and, maybe most importantly, optimization of the code.With Feko being a commercial code developed by a large company, it is to be expected that it is optimized for fast run times.As a single developer, it is difficult to match this.Still, while the total run time for our code is more than twice that of Feko, it is not an extreme run time in the sense that it makes the code unusable.

C. BIANISOTROPIC MEDIUM
To demonstrate the ability of computations involving more complex media, an example from [32] with inhomogeneous complex media was considered.The geometry was a stacked for the bianisotropic layer [32].To be able to incorporate the PEC layers in the FE-BI geometry, the structure was coated by a 0.03λ 0 thick layer of air which was not present in [32].The excitation was a plane wave incident along −z polarized along x.A mesh was generated with 26 747 FE unknowns and 5 406 BI unknowns.Due to the small number of unknowns, full BI matrices without use of ACA were used in the simulations.The bistatic RCS was computed in two cut planes corresponding to the xz (φ = 0 • ) and yz (φ = 90 • ) planes, and the resulting RCS normalized to λ 2 0 is shown in Fig. 11, together with results from [32] using a volume integral equation (VIE) based on the CFIE.Good agreement between the two methods is seen in both cut planes.

VI. DISCUSSION AND CONCLUSION
In all three numerical examples presented, there is clear agreement between the results from our FE-BI code and the reference results, which acts as a non-exhaustive verification of the code.The ACA compression has been shown to be effective, particularly in the wind turbine example.The three examples together demonstrate that a wide range of scattering problems can be solved by the code, and it is not limited to only small problems.
While the code can clearly perform computations for problems of a significant size, it is not yet able to handle the very large problems where computer clusters are required.In its current form, the code supports parallel execution on multiple cores on a single computer using Numba's automatic parallelization and parallel functionality included in other packages.More development would be needed for cluster computing with distributed memory.Much of FeniCSx is already prepared for this though, which could make that work easier.Another improvement to the code performance would be to use fast direct methods, as discussed briefly in Section III.These types of algorithms would be very suitable for problems with many righthand sides and would also reduce the influence of the ill-conditioned FE-BI system.
In this article we have described how different opensource packages have been combined in the hope that this will be useful to others in the field.It can be difficult and time-consuming to experiment with different algorithms if they have to be implemented from scratch, and as we have described in this paper, much development effort was saved by opting for this approach.We hope that others active in the field might be inspired to work more with open-source software as a wider use of this type of development could benefit the field greatly.With more software openly available for well-understood algorithms, possibilities open up for more contributors to work on the parts not yet understood.

FIGURE 1 .
FIGURE 1.General scattering problem geometry.The volume can contain inhomogeneous linear media and internal PEC regions defined by their boundaries.Excitation is due to an incident plane wave or combinations of plane waves from the exterior.

FIGURE 2 .
FIGURE 2. Nonzero entries of a FE-BI system matrix with red lines showing limits of the blocks in (11).The system corresponds to a small PEC sphere coated by a thin dielectric layer.

FIGURE 3 .
FIGURE 3. Two consecutive octree levels for a wind turbine rotor geometry.The left level has no non-adjacent groups while the right level has several.Empty groups are not shown.

FIGURE 4 .
FIGURE 4. Combination of STEP files into a common mesh with elements tagged accordingly.STEP files correspond to different materials in a (simplified) wind turbine rotor: a fiberglass shell and the air in its interior.

FIGURE 5 .
FIGURE 5. Computation times for an increasing number of mesh points using the functions displayed in Table2.

FIGURE 6 .
FIGURE 6. Flowchart showing how different packages work together.Note that this is a simplification which does not reflect the exact structure of the code.

FIGURE 7 .
FIGURE 7. Bistatic RCS in E-and H-planes.Inset shows a schematic figure of the sphere with material parameters of the layers.

FIGURE 8 .
FIGURE 8. Errors between FE-BI solution using the ACA and Mie solution.The RCS errors are quantified as root mean square errors normalized to the maximum RCS value.The absorption cross section error is quantified as the relative error.Encircled values are for the solution with RCS shown in Fig. 7.

FIGURE 9 .FIGURE 10 .
FIGURE 9. CAD model of wind turbine rotor.Inset shows internal structure along cut plane.

FIGURE 11 . 5 ⎤⎦
FIGURE 11.Bistatic RCS in two cut planes computed using FE-BI code from this work and VIE-CFIE code from [32].Data from [32] extracted using WebPlotDigitizer [69].Inset shows a schematic figure of the structure and the incident field.