Parallel FDTD modelling of nonlocality in plasmonics

As nanofabrication techniques become more precise, with ever smaller feature sizes, the ability to model nonlocal effects in plasmonics becomes increasingly important. While nonlocal models based on hydrodynamics have been implemented using various computational electromagnetics techniques, the finite-difference time-domain (FDTD) version has remained elusive. Here we present a comprehensive FDTD implementation of nonlocal hydrodynamics, including for parallel computing. As a sub-nanometer step size is required to resolve nonlocal effects, a parallel implementation makes the computational cost of nonlocal FDTD more affordable. We first validate our algorithms for small spherical metallic particles, and find that nonlocality smears out staircasing artifacts at metal surfaces, increasing the accuracy over local models. We find this also for a larger nanostructure with sharp extrusions. The large size of this simulation, where nonlocal effects are clearly present, highlights the importance and impact of a parallel implementation in FDTD.

Parallel FDTD modelling of nonlocality in plasmonics Joshua Baxter, Antonino Calà Lesina, and Lora Ramunno Abstract-As nanofabrication techniques become more precise, with ever smaller feature sizes, the ability to model nonlocal effects in plasmonics becomes increasingly important. While nonlocal models based on hydrodynamics have been implemented using various computational electromagnetics techniques, the finite-difference time-domain (FDTD) version has remained elusive. Here we present a comprehensive FDTD implementation of nonlocal hydrodynamics, including for parallel computing. As a sub-nanometer step size is required to resolve nonlocal effects, a parallel implementation makes the computational cost of nonlocal FDTD more affordable. We first validate our algorithms for small spherical metallic particles, and find that nonlocality smears out staircasing artifacts at metal surfaces, increasing the accuracy over local models. We find this also for a larger nanostructure with sharp extrusions. The large size of this simulation, where nonlocal effects are clearly present, highlights the importance and impact of a parallel implementation in FDTD.
Numerical modelling in plasmonics is typically based on optical models for bulk permittivity. These models -such as the Drude model for free electron response, and the Lorentz and critical points models for bound electron response [9], [10] -are based on the local response approximation (LRA), which assumes that the induced polarization or current at a given location depends only on the electromagnetic field at that same location. While appropriate for many applications, the Drude model is insufficient for modelling plasmonic structures smaller than 10 nm [11]- [14], as well as those containing sharp features or nanoscale gaps [15], [16]. The charge density near the surface of plasmonic objects is spread over a finite thickness on the order of the a few angstroms [16], [17], and thus for small features, cannot be treated as localized at the surface, as is implicit in LRA models. The hydrodynamic plasma model treats the conduction band electrons as a free electron gas [5], [12], [16]- [18], accounting for free electron fluctuations via a pressure term. Unlike the Drude model, it does not make an LRA and thus nonlocality is incorporated. It has been found to correctly predict the expected spread of charge density near plasmonic surfaces as well as predict the blueshift in plasmon resonance with decreasing particle size [13], [17], [19]. More recently, the effects of electron diffusion were incorporated within a generalized nonlocal optical response (GNOR) model, correctly predicting a broadened line-shape that was observed experimentally [20].
Nonlocal hydrodynamic models have been implemented using several computational electromagnetic methods including the finite element method [21], the discontinuous Galerkin time domain method [22], and the boundary element method [23]. A finite-difference time-domain (FDTD) implementation would be advantageous due to the wide-spread use of FDTD in the photonics community, its relative ease of implementation, its broadband capabilities, and its ability to produce timedomain movies. Only two attempts have been made [24], [25], where one is based on an erroneous approximation, neither consider high performance computing and parallel implementations, and neither include electron diffusion. A correct and comprehensive FDTD implementation has remained unreported.
In this paper, we present a parallel FDTD implementation of nonlocal hydrodynamics, including a GNOR implementation. A parallel implementation in FDTD is especially important for simulating nonlocality, as the grid cell size needs to be smaller than the Fermi wavelength λ F ∼ 0.5 nm to capture the spread of electron density [26]. This can require a large amount of memory, and can present a prohibitive computational load. High-performance computing represents a viable solution. The FDTD rectangular-meshing scheme, where a field update at one location requires fields only in adjacent Yee-cells [27], lends itself well to parallel computing; indeed, message passing interface (MPI)-based FDTD solvers are well reported for LRA models [28]. Furthermore, highperformance computing is becoming more accessible with cloud computing, computing consortiums, increased highperformance computing investments, and new, higher-level parallel programming languages such as Chapel [29].
The structure of this paper is as follows. We review nonlocal hydrodynamics in Section II, including the most common version without electron diffusion, as well as GNOR, which does include electron diffusion. In Section III, we derive from the nonlocal models FDTD update equations for the polarization field via the auxiliary differential equation (ADE) method. In Section IV, we discuss the implementation of the nonlocal FDTD update equations for parallel computing within a MPI framework. In Section V, we test our implementations by simulating the optical response of small metallic nanospheres and comparing our results to analytic solutions and experimental results. We find an unexpected benefit of arXiv:2005.06997v1 [physics.optics] 14 May 2020 nonlocal versus LRA modelling: a marked decrease in staircasing artifacts at the metal boundary. Due to the rectangular discretization inherent in most FDTD approaches, fields can build up in an unphysical manner at plasmonic surfaces, and this can be particularly problematic in applications that rely on plasmonic near-field enhancement [30]. The incorporation of nonlocality significantly decreases this unphysical field build up. In Section VI, we simulate the response from a spherical nanoparticle containing sharp extrusions as a demonstration of a large-scale simulation that requires both parallel computing and nonlocal modelling. We find that despite the nanoparticle being larger, the sharp extrusions exhibit plasmonic features not accessible to the LRA. Finally, in Section VII we give concluding remarks.

II. HYDRODYNAMIC MODELS FOR NONLOCALITY
The most general response of a linear optical material to incident radiation is described by where the dielectric function of the material, ε(r, r , ω), is nonlocal when the displacement field D at one location depends on the electric field E at other locations. In many applications, it is appropriate to use a local response approximation (LRA), wherein ε(r, r , ω) = δ(r − r )ε(ω). In plasmonic modelling, the interaction of the electric field and the free electron plasma is typically described by the Drude model, which employs the LRA, where the dielectric function reduces to [9] ε(ω) = 1 − ω 2 where ω p is the plasma frequency, γ is a collisional damping rate, and i is the imaginary unit. The hydrodynamic plasma model [17] goes beyond the LRA, more accurately describing spatial-temporal free electron dynamics via ∂v ∂t where v is the velocity field of the free electron plasma, E and B are the electric and magnetic fields, and n is the free electron density. The energy functional G[n] considers the internal kinetic energy of the electron gas and is usually taken to be the Thomas-Fermi functional, giving where h is Plancks constant. As we are only considering the linear nonlocal response in this paper, we neglect the nonlinear terms (magnetic-Lorentz and convection) in Eq. 3 giving where β 2 = 1/3v 2 F and v F is the Fermi velocity. The free electron density fluctuations are accounted for in the last term of Eq. 5. This is often referred to as the pressure term, and is responsible for the known blue shift in the surface plasmon resonance with decreasing particle size [13], [17]. Along with this formula for the velocity field, we require the continuity equation given by From here we consider two approaches. The first is the most widely used, and assumes a current density given by J = −env. As we use a polarization field formulation in this paper, we set J = ∂P N L ∂t , where we have defined P N L to be the (nonlocal) free electron polarization field. The velocity field is thus v = − 1 ne ∂P N L ∂t , and Eq. 6 becomes where n 0 is the equilibrium free electron density. From Eq. 5, we then obtain where ω p = e 2 n0 mε0 . We call Eq. 8 the "nonlocal Drude model" because it consists of the classical LRA Drude model plus one additional term that gives rise to nonlocality (i.e., the term proportional to β 2 ).
The model represented by Eq. 8 differs from those used in the two previous nonlocal FDTD works. In Ref. [24], the gradient-divergence term was simplified to a Laplacian, and this resulted in spurious, non-physical resonances [18]. In Ref. [23], a current density formulation is used. A benefit to using a polarization field formulation is that it gives ready access to the free electron density via Gauss's Law, as we demonstrate in Section V.
The second approach we consider includes electron diffusion, which was also found to play an important role in the optical response of small metallic particles [20]. This has been described by the generalized nonlocal optical response (GNOR) model, where diffusion is considered by modifying the expression for the velocity field via v = − 1 en ( ∂P G ∂t + D∇(∇ · P G )), where D is the diffusion coefficient and where we have denoted the nonlocal GNOR free electron polarization field by P G . GNOR correctly predicts both the blueshift in the plasmon resonance frequency as well as the broadening of the absorption peak with decreasing particle size. A thorough discussion of the nonlocal Drude and GNOR models is given in Ref. [17].
Using in Eqs. 5 and 6 the GNOR definition for the velocity field, we obtain the following time-domain nonlocal-diffusive hydrodynamics model for the polarization field, which we hereafter refer to as the "GNOR" model: where η = β 2 + Dγ. It is from Eqs. 8 and 9 that we derive in the next section our FDTD update equations for implementing the two different models of the free electron response of a plasmonic material, one accounting for nonlocality only (Eq. 8), and the other nonlocality with diffusion (Eq. 9).
To properly model the optical response of many plasmonic materials, one must also include the contribution of bound electrons. We employ the LRA-based N-critical points model that assumes a susceptibility of the form where ε ∞ is the infinite frequency permittivity, and A p , Ω p , Γ p , and φ p are fitting parameters. This model can be readily transformed to the time-domain for FDTD implementation, as described in detail in Ref. [31].

III. UPDATE EQUATIONS FOR NONLOCAL FDTD
We derive in this section the FDTD update equations for the time-domain nonlocal Drude and GNOR models (Eqs. 8 and 9, respectively), using the ADE method. We discretize our domain via the Yee cell [27] where the electric fields are collocated with the polarization fields in time and space; the Yee cell positions of the electric, magnetic, and polarization fields used in this paper are listed in Table I. We denote the free electron polarization field by P f (which signifies either P N L or P G ), and the bound electron polarization field by P CP .
The FDTD update algorithm at time = n∆t, where ∆t is the time step size, consists of updating (in order) the: , via update equations f 1 , f 2 , f 3 , and f 4 , whose form and required inputs depend on the model from which they are derived. The equation f 1 is the magnetic field update derived through discretization of the Maxwell-Faraday equation [32]; we do not derive this here because in plasmonic simulations it is typically unchanged from the vacuum equation. In what follows, we present the electric field update equation f 2 (derived from the Maxwell-Ampère law), the bound charge polarization update equation f 3 (derived from Eq. 10), and the free charge polarization update f 4 (derived from Eq. 8 for the nonlocal Drude model, and Eq. 9 for the GNOR model).
Though we present f 4 for two nonlocal models, one must chose which to use -they cannot be used simultaneously.
We start by deriving f 4 from the nonlocal Drude model. Using central differencing, we discretize Eq. 8 centered at time n∆t, to obtain Further, using weighted central averaging as discussed in Ref. [30], we set E n = (E n−1 + 2E n + E n+1 )/4 to obtain where As the x, y, and z components of ∇(∇ · P n N L ) must be collocated with the x, y, and z components of P N L , we have where, for brevity, the components of P N L are written as P a for a = x, y, z. For the GNOR model, the f 4 update equation includes Eqs. 12-18 (with β 2 replaced by η) along with a suitable discretization for the last term of Eq. 9, the only term that does not appear in the nonlocal Drude model. As this term cannot be achieved via central differencing, we use an alternate second order finite difference scheme given by The GNOR f 4 update then becomes where The f 3 update equation for the bound charge density is derived from the critical points model (Eq. 10) in Ref. [31]. We state it here: where p denotes the p th critical point.
Next we turn to the f 2 update equation for E n+1 . Ampre's law is discretized in time to give For the nonlocal Drude model we set P f = P N L , and plug Eq. 12 into Eq. 30 to obtain For the GNOR model we set P f = P G , and plug Eq. 20 into Eq. 30 to obtain For the GNOR update, to avoid recalculation of ∇ ∇ · P n−1 G , and ∇ ∇ · P n−2 G , we store them in arrays; therefore GNOR requires additional memory.
The update equations derived above must be implemented in all cells in which the plasmonic material exists. However, care needs to be taken if the nonlocal material extends to the boundary of the simulation domain. Terminating a nonlocal material with a perfectly matched layer (PML) may result in instability and convergence issues. Within a total field/scattered field (TF/SF) framework -which is applicable to many plasmonic nanostructure scattering problems -this can be overcome by using the nonlocal model only in the total field region. If TF/SF cannot be used, such as for geometries that include a plasmonic substrate, one may use the LRA-Drude model in the PML region, and the nonlocal models everywhere else.
Finally, we would like to highlight that since the polarization fields are nonlocal, an additional boundary condition is required at the interface between the plasmonic and external media. In our FDTD implementation, we impose the Pekar additional boundary condition [33] by setting P f = 0 outside of the plasmonic structure (that is, by not updating P f outside the structure).

IV. PARALLEL IMPLEMENTATION OF NONLOCAL FDTD
In this section we describe a parallel FDTD scheme using the message passing interface (MPI) framework. Nonlocal simulations require a lot of memory since the step-size ∆x needs to be smaller than the Fermi wavelength. In turn, significant computation time is required because the Courant Friedrichs Lewy condition restricts the time-step ∆t according to ∆x. Parallel computing thus becomes essential.
The simulation domain is decomposed into n x × n y × n z MPI processes where n d is the number of MPI processes in the d direction, with d = x, y, z. Each MPI process is identified by a vector (m x , m y , m z ) which gives its relative spatial position within the simulation domain, where 0 ≤ m d < n d .
Each process performs field updates within its own subdomain (i.e., its section of the simulation domain) defined and field updates field updates according to a local grid with (N x + 1) × (N y + 1) × (N z + 1) points. The vector (i, j, k) identifies an individual grid cell within the local grid, where i ranges from 0 to N x , j from 0 to N y , and k from 0 to N z , inclusively. The electric and magnetic field components at (i, j, k) correspond to different locations in physical space within the grid cell, according to the Yee cell shown in Table I. For example, E x (i, j, k) refers to E x (i∆x + ∆x/2, j∆y, k∆z), whereas E y (i, j, k) refers to E y (i∆x, j∆y + ∆y/2, k∆z). The use of domain decomposition requires that data from the boundaries of subdomains be transferred to other subdomains at each time step. For subsequent updates to be executed efficiently, an overlap of information is required between adjacent subdomains. For example, the cells (N x , j, k) in subdomain (m x , m y , m z ) represents the same physical locations as the cells (0, j, k) in subdomain (m x + 1, m y , m z ). The update scheme we describe below guarantees that the update equations for a given field component at a given physical location are only applied once.
The components of the magnetic and electric fields in a subdomain are updated via f 1 and f 2 , respectively, however each is updated for different ranges of indices (i, j, k) as summarized in Table II. The notation 0 → N d means we update from index 0 to index N d , inclusively. Bound and free charge polarization fields are updated via f 3 and f 4 , respectively, for the same ranges of indices as for the electric field, as listed in Table II. To understand this visually, the field update regions for the x and y components of all fields within a single subdomain are illustrated in Fig. 1.
While the components of the magnetic, electric, and bound charge polarization fields in a subdomain all require (N x + 1) × (N y + 1) × (N z + 1) values to be stored, the free charge polarization fields requires (N x +2)×(N y +2)×(N z +2). This  is because the free charge polarization field updates via Eq. 18 requires information from additional cells. For example, to update P x,N L (0, j, k) via Eq. 18 (a), we need P x,N L (−1, j, k); in general, we need an extra cell in each dimension to store the −1 index. However, we do not calculate updates at this index, as (−1, j, k) in subdomain (m x , m y , m z ) obtain their values from a transfer of data from cells (N x − 1, j, k) in subdomain (m x − 1, m y , m z ). We now discuss the flow of the FDTD algorithm for the both the nonlocal Drude and GNOR models, including what data needs to be communicated, and when. This is summarized in Fig. 2, and we go through each step in detail in the following.
The first update to execute after the setup of the simulation is that of the magnetic field via f 1 . Once this has been completed in each subdomain, a subset of the magnetic field values at the subdomain boundaries must be communicated to adjacent subdomains. The necessary communications for H x are given in the first row of Table III; those for the other components of H can be obtained from this table by an ordered permutation of x → y → z. The inter-subdomain communications in the x and y directions are illustrated in the top part of Fig. 3. Note that the data transfers of the magnetic fields are all made in the "backward" direction. The inset in Fig. 3 details all data transfers along x and their relative Yee cell positions. For example, H y (0, j, k) updated locally in subdomain (m x + 1, m y , m z ) is passed to H y (N x , j, k) in subdomain (m x , m y , m z ).
After the magnetic field data transfer, the electric field is updated in each subdomain via f 2 , after which the bound and free charge polarization fields are updated in each subdomain via f 3 and f 4 , respectively. Subsequently, a subset of their values at the subdomain boundaries need to be communicated to adjacent subdomains. The necessary communications for E x are given in the second row of Table III; again, those for the other components are obtained by an ordered permutation of x → y → z. The inter-subdomain communications electric field in the x and y directions are also illustrated in the top part of Fig. 3, with further detail given in the inset. Unlike the magnetic field data transfers, those for the electric field are made in the "forward" direction. For example, E y (N x , j, k) updated in subdomain (m x , m y , m z ), is passed to E y (0, j, k) in subdomain (m x + 1, m y , m z ). Note that only the E and H components tangential to the direction of the data transfer need to be exchanged. In general, FDTD updates for LRA polarization models only require the collocated electric and polarization field values so that the polarization field values need not be communicated to other subdomains. Thus no communication is necessary for the bound charge polarization field P CP . This is not true for the nonlocal models. The necessary communications for the x component of P f are given in the third row of Table III and visualized in the top part Fig. 4. The communication for the y and z field components are obtained from Table III again via an ordered permutation of x → y → z. The inset of Fig. 4 shows the data transfers along x for all field components. Note that there are now four data transfers for every P f component -three "forward" and one "backward" -due to the increased data required for updates via Eq. 18. Thus, the inclusion of nonlocality doubles the number of communications required at each time-step. This has an effect on performance and scalability and is discussed further in Section V.
It is worth noting that MPI is not the only solution for parallel computing as new higher-level languages are being introduced for this purpose. Chapel, a language produced by Cray [29], allows for algorithm implemenation on a distributed system without the challenges of MPI. For example, Ref.
[34] presents a finite difference implementation of Poissons equation in Chapel. One may also use a shared memory implementation where the memory is shared amongst the processes (or threads) and therefore no data need be communicated. This can be implemented via OpenMP [35] for multi-threading on CPUs or via CUDA [36] or OpenCL [37] on graphics processing units (GPUs). Indeed, GPU-FDTD implementations are well reported in literature [38]. Since GPUs can launch thousands of parallel threads that all have access to shared memory, a GPU-based implementation of nonlocal FDTD requires no special treatment beyond what was presented in Section III. While GPUs do suffer from memory constraints, they can be still be useful for smaller nonlocal plasmonic simulations.

Hx
Transfer y = 0 plane backward to adjacent y = Ny plane Transfer z = 0 plane backward to adjacent z = Nz plane Ex Transfer y = Ny plane forward to adjacent y = 0 plane Transfer z = Nz plane forward to adjacent z = 0 plane

V. NONLOCAL FDTD APPLIED TO SMALL SPHERES
In this section, we test and validate our FDTD implementations of the nonlocal models by using them to simulate the optical response of small silver nanospheres. We compare our results to (1) analytic solutions based on Mie theory, (2) LRA FDTD plasmonic simulations that employ the LRA Drude model for free-electron response, and (3) experimental results from the literature. Not only do our nonlocal FDTD simulations agree well with analytic and experimental results, we also find an unexpected benefit over the LRA approach: a pronounced reduction of staircasing artifacts.
In Fig. 5, we plot the absorption efficiencies for silver nanospheres in vacuum for three different diameters -4 nm (blue), 6 nm (green) and 10 nm (red) -and different free charge polarization models. To model silver, we use the fitting parameters reported in Ref. [39] for ω p and γ in Eqs. 8 and 9, and for all critical points model parameters in Eq. 10. We set β 2 = 1/3v 2 F [17] where v F = 1.39 · 10 6 m/s [40]. The FDTD domain is a 200 cell × 200 cell × 200 cell box truncated by a convolutional perfectly matched layer (CPML) [32] consisting of 20 additional cells at each boundary. We use a uniform step-size of ∆x = ∆y = ∆z = D/100, where D is the diameter of the sphere. For the particle sizes of interest here, this guarantees both that the step size is less than the Fermi wavelength, and that the spherical shape is sufficiently resolved. The total number of iterations vary with step-size and therefore particle size. For a 10 nm diameter particle, 3 · 10 5 iterations are used to reach convergence; this is scaled appropriately for the other particle sizes.
In Fig. 5 (a), we compare our nonlocal Drude FDTD calculations (filled circles) with classical (LRA) Mie theory solutions (dashed lines), and with nonlocal Mie theory solutions (solid lines). Nonlocal Mie theory [41]- [43] is a modified version of Mie theory [44] that allows for longitudinal modes, which permits a non-zero free charge density within the nanoparticle. It has been successfully validated with experimental results for small particles larger than several nanometers. For example, a quasistatic version predicted absorption peaks that agree quantitatively with experiment for particle diameters down to 10 nm with a further qualitative agreement down to 2 nm [19].
We find in Fig. 5 (a) excellent agreement between nonlocal Drude FDTD and nonlocal Mie theory, with less than 2% mean error for all three particle sizes. Doubling the FDTD step-size increases the mean error for the 10 nm particle case to 6.4%. Halving the step-size decreases it to 1.3 %. Thus reasonable convergence is reached with reasonable simulation domain sizes. As expected, we see an increasing blue-shift in the absorption peak with decreasing nanoparticle size with respect to the LRA Mie theory solution.
In Fig. 5 (b) we compare our GNOR-FDTD calculations (filled circles) with LRA Mie theory solutions (dashed lines), and nonlocal diffusive Mie theory solutions (solid lines), where diffusion is accounted for by substituting β 2 with β 2 + Dγ − iDω in nonlocal Mie theory [17]. The free electron diffusion coefficient in silver is taken as D = 3.61 · 10 −4 m 2 /s [45]. We again see excellent agreement, with less than 1.6% mean error in GNOR-FDTD relative to nonlocal diffusive Mie theory for all three particle sizes. As expected the resonance positions predicted by the nonlocal Drude FDTD in Fig. 5 (a) and GNOR FDTD in Fig. 5 (b) are the same, with an increased line width for GNOR FDTD.
We now turn to examining the near field and free electron density distributions produced by nonlocal Drude FDTD, comparing to those produced by LRA FDTD. In Fig. 6 we show the electric field amplitude distribution for the simulations of the 4 nm diameter silver nanoparticle produced by (a) LRA FDTD and (b) nonlocal Drude FDTD, for the wavelength corresponding to the peak of the (nonlocal) absorption spectrum (λ = 343 nm). Shown are cuts in the xz plane through the centre of the particle, where the incident plane wave is zpolarized and propagates along the y-axis. The field amplitudes are normalized, corresponding to an input field of 1 V/m. For higher quality images, we used a halved step size of ∆x = D/ 200. LRA FDTD produces a constant electric field within the sphere, as expected, while nonlocal Drude FDTD produces a field gradient due to the non-zero free charge distribution.
Unexpected in Fig. 6 is the significant difference in the appearance of the fields at the particle boundary. While the effects of staircasing in LRA FDTD are clearly visible at the edges of the sphere in Fig. 6 (a), they appear smoothed out for nonlocal Drude FDTD in Fig. 6 (b). This is even more evident in Fig. 7 which shows the normalized electric field amplitude distribution produced by a) LRA FDTD and b) nonlocal Drude FDTD at λ = 425 nm, a wavelength where the absorption efficiency and near fields for both approaches are almost identical, except for dramatic differences at the sphere boundary. The staircasing-induced boundary artifacts seen for LRA FDTD are clearly reduced with nonlocal Drude FDTD. Artifacts at the particle boundary are especially problematic for calculations that involve fields just outside the particle, such as, for example, determining the plasmonic near field enhancement of fluorescence, or engineering the spontaneous emission lifetime of fluorophosphores [46]. This could also be important for calculations of plasmonics-enhanced nonlinear optics, where enhanced near fields close to plasmonic boundaries can be harnessed to not only enhance nonlinear optical processes by orders of magnitude, but also shape nonlinear optical fields [47]- [49]. With nonlocal FDTD, the reduction of staircasing artifacts would result in more reliable calculations.
In Fig. 8 we plot the free electron density distribution corresponding to the 4 nm silver particle simulations of Fig.  6. Shown are cuts in the xz plane through the center of the sphere for λ = 343 nm, as produced by (a) LRA FDTD and (b) nonlocal Drude FDTD. As expected, nonlocal Drude FDTD allows for the spread of the charge distribution near the particle boundary, which we see is on the order of the Fermi wavelength for silver, λ F = 0.5 nm. In contrast, for LRA FDTD the charge is bound to the surface.
We present in the Supplementary  cos(ω max t))/2] 3 , where ω max = 1.26 · 10 16 rad/s is the maximum frequency of interest (corresponding to λ = 150 nm). Included in the SI are movies of the electric field amplitude dynamics for nonlocal Drude FDTD in the xz and yz planes, movie1, and movie2, respectively, where both planes cut through the center of the particle. The comparable movies for LRA FDTD are movie3 and movie4. The nonlocal FDTD movies show a radially propagating wave inside the sphere whereas the LRA FDTD movies do not; in the latter, the field is almost always constant across the sphere (as expected). Further, while staircasing artifacts in the LRA FDTD movies are quite pronounced, they are hardly visible in the nonlocal Drude FDTD movies. The corresponding free electron density movie for nonlocal Drude FDTD is movie5. Since the free electron density does not need to be tracked within the LRA FDTD implementation, and all charge is strictly localized to the surface, we do not include a time-domain movie for this case.
We now turn to further validate our FDTD implementations by comparing our simulated results to those of experiment. In Ref. [13], electron energy loss spectroscopy measurements are presented for silver nanosphere diameters ranging from 2 to 24 nm on a carbon film and compared to Mie theory calculations that use size-dependent permittivities derived quantum mechanically. Using GNOR FDTD, we calculate the absorption spectra of silver nanospheres in a n = 1.3 dielectric background (as used in the calculations of Ref. [13]) with diameters ranging from 2 to 24 nm, and plot these in Fig. 9. We find the same trend as presented in Ref. [13], showing that GNOR FDTD is consistent with experimental measurements and calculations using quantum-based permittivities, with quantitative agreement down to 10 nm diameter, and qualitative agreement to 2 nm. As discussed in Ref. [17], for diameters less than 10 nm, the nonlocal model predicts a resonance shift that is not as large as determined by experimental measurements, which is consistent with our results. This may be due to more complicated phenomena occurring in silver, such as inhomogeneous equilibrium electron density, or spillout effects, that are not included in the GNOR model we have implemented.
Finally, we turn to a discussion of computational resources and scalability. To give an example, for the 10 nm, 6 nm and 4 nm sphere simulations presented in Fig. 5, we used 64, 128, and 256 cores, respectively, on the Graham cluster operated by Compute Canada [50]. Recall that the workload is heavier for the smallest particle since our step size ∆x (and thus time step size) is proportional to sphere diameter, and thus more iterations in time are required to reach convergence for the smallest particles. In general, when running simulations on a large number of cores, as we do for this paper, it is imperative to investigate the implementation's scalability, that is, the performance enhancement obtained by increasing the number of CPUs. In the ideal case, the scaling is linear, meaning when one doubles the number of processes, the computation time is halved; this is often not achieved due to overhead, including inter-processor communications.
In Fig. 10, we plot the number of FDTD time iterations calculated per hour as a function of the number of processes used in the nonlocal Drude FDTD (blue line) and LRA FDTD (red line) simulations. The simulation set up is the same as for Fig. 5, where we vary the number of processes while keeping the total number of cells constant. For ≤ 128 processes, the scalability of nonlocal Drude FDTD is nearly linear and comparable to LRA FDTD. While the scalability 40 nm 5 nm Fig. 11. Cross section in the xz plane of the star-shaped silver nanoparticle, consisting of a sphere with 40 nm diameter and triangular nanoprism extrusions extending 5 nm from the sphere surface. The incident plane wave used to illuminate this structure is z-polarized and propagates along the y axis.
of LRA FDTD remains mostly linear for a larger number of processes, that of nonlocal Drude FDTD does not, likely due to the doubled inter-process communication required, as discussed in Section IV. As one increases the number of processes while maintaining the same domain size, the subdomain surface-to-volume ratio increases, and therefore the interprocess communication time eventually becomes comparable to the computation time within a time step. Thus, if we were to consider a larger domain size, we expect the near-linear scalability to extend to larger process numbers.
VI. NONLOCAL FDTD APPLIED TO LARGER, COMPLEX

NANOPARTICLES
In this section, we demonstrate the benefit of our parallel FDTD implementation even further, by considering a much larger plasmonic structure containing small, sharp features. Such a structure would still be expected to exhibit nonlocal effects, and thus simulating it via FDTD would still require a small grid cell size, small enough to resolve electron dynamics within the sharp features. We consider as an example a structure inspired by the star-shaped nanoparticles synthesized for Ref. [51], which are spherical particles around 50 nm in diameter with small 5 nm extrusions from their surfaces that come to a sharp tip.
The structure we simulate is a silver nanosphere with triangular nanoprisms extruding from the equator in the xz plane, perpendicular to the incident plane wave propagation axis, as illustrated in Fig. 11. The sphere is 40 nm in diameter and the nanoprisms are embedded into the sphere so that the effective prism length is 5 nm. The tips of the nanoprisms are rounded with a radius of 1 nm. The FDTD step-size is uniform with ∆x = 0.1 nm and the domain size is 600 × 600 × 600 Yee cells (not including CPMLs). The simulations are run for 4 · 10 5 iterations over 1000 processes (10 × 10 × 10).
The absorption spectrum of this particle is shown in Fig.  12, calculated using both LRA FDTD (red) and nonlocal Drude FDTD (blue). There are two resonances attributed to the star-shaped particle. The peaks near 355 nm corresponds to the plasmonic resonance of the sphere, and these completely overlap for the two models. This is expected, as nonlocal effects should be negligible for particles larger than 20 nm. The peaks near 575 nm, however, correspond to the resonance of the nanoprisms, specifically the ones aligned parallel to the incident field polarization. Here we do see that nonlocal effects become important, as the peak wavelength predicted by nonlocal Drude FDTD is 10 nm blue shifted from that predicted by LRA FDTD. Additional differences between the two models manifest in the field amplitude distributions for wavelengths below the interband transition wavelength λ IB ≈ 330 nm. The electric field amplitude distributions at λ = 320 nm are shown in Fig. 13 for (a) LRA FDTD and (b) nonlocal Drude FDTD. Standing waves inside the vertical triangles, with a wavevector in the z-direction, are visible in Fig. 13 (b) for nonlocal Drude FDTD. This is a longitudinal mode [18], and is expected since its wavelength lies below the epsilon-near-zero wavelength, which in silver is approximately equal to λ IB . In the LRA, these modes cannot be excited by an incident transverse wave, and we see in Fig. 13 (a) that they are not. However, with nonlocality, such excitation can occur [18] due to the Pekar additional material boundary conditions, discussed in Section III.
These standing waves become more pronounced for isolated nano-triangles as shown in Fig. 14 (b) where nonlocal Drude FDTD was used, whereas they are absent in Fig. 14 (a), where LRA FDTD was used. The standing waves found for nonlocal FDTD are damped by interband transitions for wavelengths below 300 nm.
We now turn to examining the effect of staircasing for the star-shaped nanoparticle. We show in Fig. 15 a zoomedin view of the electric field amplitude distribution at λ = 475 nm, a wavelength where there is more contrast at the particle boundary than for the wavelength considered in Fig.  13. Despite a very fine mesh of ∆x = 0.1 nm, one can clearly see staircasing artifacts in the electric field amplitude distribution for LRA FDTD in Fig. 13 (a). These are notably reduced in Fig. 13 (b), where nonlocal Drude FDTD was used.
This reduction in staircasing artifacts is further illustrated by the time-domain movies of the star-shaped nanoparticle simulations presented in the SI. These depict the time evolution of the electric field amplitude and free electron density in the xz and yz planes through the center of the particle, where the incident plane wave is polarized in z and propagates along y. The electric field evolution movies for nonlocal Drude FDTD are movie6 and movie7, while those for the corresponding free electron density evolution are movie8 and movie9. The electric field evolution movies for LRA FDTD are movie10 and movie11.

VII. CONCLUSION
We have introduced a parallel FDTD implementation for modelling nonlocality in plasmonics. We used an auxiliary differential equation approach to model nonlocality with and without electron diffusion, and described in detail how to use the message passing interface framework for parallel computation via domain decomposition. After validating our implementation via comparisons with analytical and experimental results for small nanospheres, we demonstrated the robustness of our parallel implementation for larger particles with sharp nanoscale features. We find that the inclusion of nonlocality within FDTD significantly reduces staircasing artifacts that can plague standard plasmonic FDTD modelling based on the local response approximation (LRA). This suggests that beyond its importance for modelling nonlocality, nonlocal FDTD might be advantageous for calculations that require precise values for the fields at plasmonic boundaries. This includes, for example, determining plasmonic fluorescence enhancement, plasmonics-mediated fluorescent lifetime engineering, and plasmonics enhanced nonlinear optics.