A Multi-Scale Attention Network for Uncertainty Analysis of Ground Penetrating Radar Modeling

A multi-scale attention-based model (MSAM) is proposed as a surrogate model for uncertainty analysis (UA) in ground penetrating radar (GPR) simulation. Instead a thousand of full-wave simulations, the surrogate model converts the uncertain inputs to electric fields, and the output uncertainty is effectively quantified. Global feature aggregation (GFA) module and local affinity reconstruction (LAR) are presented to improve the model representation capability by Affinity calculation under different receptive fields. In addition, a new loss function is proposed to accelerate the convergence of the model for training data with a wider range of input disturbances. The effectiveness and accuracy of the surrogate model are verified by comparing the UA results with the Monte Carlo method (MCM). In comparison with existing deep learning methods, the proposed method can efficiently get higher quality predictions. Meanwhile, the Sobol indices evaluated by MSAM accord with those of MCM, and the mean square error between them is only 0.0005. However, the MCM needs to run the full-wave simulation one thousand times to converge, which is much more time consuming than the proposed surrogate model.


I. INTRODUCTION
One of the most efficient ways to research and analyze electromagnetic wave propagation processes in ground penetrating radar (GPR) systems is through simulation [1], [2], [3].However, the lack of prior knowledge of the input parameters leads to uncertainty in the output of the simulation.Hence, in spite of the final simulation results can be computed by available methods, It is vital to quantify the confidence level of the simulation results [4].To achieve this goal, the uncertainty analysis (UA) [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14] has been conducted.In the traditional uncertainty analysis methods, the more frequently used is the Monte Carlo method (MCM) [12], [13], [14].It typically requires running thousands of full-wave simulations to converge, which brings a huge amount of computation [5].To remedy this shortcoming of MCM, some relatively efficient UA methods are proposed in [4], [5], and [6].The main drawback of those methods is that their computational complexities The associate editor coordinating the review of this manuscript and approving it for publication was Su Yan .
relate to the number of uncertain inputs, resulting in the curse of dimensionality.As mentioned in [4], the computational complexity of the modified intrusive generalized polynomial chaos expansion (gPCE) [8], [9], [10], [11], [14] method increases significantly with the number of input uncertain parameters.As a result, the improved intrusive gPCE method is not readily available in practice when the number of input parameters is greater than three.Currently, with the rapid development of deep learning and artificial neural network [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], a surrogate model based on fully connected neural networks (FCNN) is proposed for UA, which is trained to mimic the behavior of GPR simulation and then predict the simulation results [24].The model has no restriction on the number of uncertain inputs and dispenses with prior knowledge of uncertain inputs, but the prediction accuracy depends heavily on a large number of training data.Also, when the fluctuation of uncertain inputs surpasses 10%, indicating more disturbances in the training data, the FCNN-based model fails at this point.It is demonstrated that the capacity and robustness of the model need to be improved.Furthermore, to improve the efficiency of a deep learning-based surrogate model, the number of training samples should be reduced.
In this article, a well-designed multi-scale attention-based model (MSAM) based on attention mechanism [25] is applied to UA in GPR modeling, which exploits global feature aggregation (GFA) module and local affinity reconstruction (LAR) module to introduce global and local attention and generates more informative feature representation of the input parameters.A loss function tailored for UA is also devised to supervise the model to learn the original distribution of the simulation outputs.To validate the effectiveness of the surrogate model, the UA result of MCM from a two-dimensional (2-D) GPR system generated by the auxiliary differential equation finite-difference time-domain (ADE-FDTD) approach is employed.Comparison with the FCNN [24] model, the convolutional neural network (CNN) model [26], [27], and the transformer model [28], [29], [30], the results of MSAM are more consistent with those of MCM, and the simulation is more efficient.Meanwhile, the Sobol indices of MSAM have a high agreement with MCM.The accuracy and robustness of the proposed model are validated.

II. PROPOSED SURROGATE MODEL FOR UA
Fig. 1 depicts the general architecture of MSAM.The surrogate model is trained using the uncertain inputs and simulation results of ADE-FDTD.To maintain the position information of uncertain inputs x x x ∈ R L (L = 7 corresponds to the number of uncertain inputs), a convolution layer with the kernel size of 1 × K (K is generally equal to L) is used.Alternating GFA and LAR modules are employed to attend the different scale attention information into the original feature, which strengthens the robustness and representational capability of the proposed model.And adding skip connections for multiple attention modules to preserve stability during training.Rather than executing hundreds of full-wave electromagnetic simulations, the trained MSAM is adopted as a surrogate model to predict the output E E E| z for UA problems.

A. ADE-FDTD SIMULATION OF GPR SYSTEM
A 2-D GPR system is presented in Fig. 2, which is used for the simulation of ADE-FDTD [3], [4], and the uniaxial perfectly matched layer (UPML) [3] is used as the absorbing boundary condition.In the GPR system, the underground is filled with soil which is considered a nonmagnetic medium and has strongly dispersive properties in the operating frequency range of GPR.The transmitter and the receiver are denoted by Tx and Rx, respectively, and use Blackmann-Harris pulse as excitation pulse [4].The center frequency f = 200 MHz, and T s = 1.55/f .A solid metallic block with the size of 1m × 1m and a piece of dry granite with the size of 0.5m × 0.5m are buried in the dispersive soil which is modeled by a two-term Debye model.Assuming there are seven uncertain inputs σ s , ∞ , s , A p , and τ p (p = 1, 2) which are given in [3], where σ s is the static conductivity, ∞ is the electric permittivity at infinite frequency, s is the static electric permittivity, A p is the pole amplitude, and τ p is the relaxation time.The complex relative permittivity r of the Debye model is given below: where ω is the angular frequency, 0 is the electric permittivity in free space, and j 2 = −1.For the analysis of propagating waves in ADE-FDTD, the first auxiliary variable for the electric field E| z is where Z i are associated with the x, y, and z normal planes, respectively, and the form of Zi is The detail s i and σ i are presented in [5].The second auxiliary variable is The third auxiliary variable is Submitting ( 2), ( 4) and ( 5) into the update equations of ADE-FDTD ( 6) -( 9), we obtain the electric field E | z and its corresponding magnetic fields H x and H y by equation (8) [4], [24].During the iterations of ADE-FDTD, the three auxiliary variables are represented as L k z ( x. y, θ), D k z ( x. y, θ) and R k pz ( x. y, θ), and they are calculated by equations ( 6), ( 7) and ( 9).The grid size along the x-and y-directions x = y = 5 mm, The time step t = 8.33 ps, and θ represents the uncertain input.
However, there are uncertainties in the parameters of the Debye model due to measuring tools and other factors.Those uncertain inputs result in the uncertainties of outputs, which can affect the electromagnetic pulses and then the survey of the targets buried in the ground consequently.The greater the variation in the inputs, the larger the uncertainties in the outputs.It is demonstrated that quantifying the uncertainty in the simulation results should be considered [4].For constructing deep learning based surrogate model, the Seven uncertain inputs and 5000 time steps of the electric field value E E E | z comprise the input-output pairs for model training.

B. GLOBAL FEATURE AGGREGATION
Although the convolutional layers in the surrogate model can learn some meaningful embedding features, it still lacks focus on important information.To enable the network to better extract key information from the features, the GFA module scales the features in different directions to perform global information aggregation.The internal architecture of GFA is shown in Fig. 3. First, the input feature map H H H ∈ R B×C×L is compressed by two point-wise convolutional layers in different directions to obtain feature matrices H H H C ∈ R B×C×1  and H H H S ∈ R B×1×L .B is the input batch size, L is the number of input parameters and C is the channel of features, which controls the representational capacity of the model.Instead of using pooling for feature reduction, point-wise convolution is used for reducing the possibility of losing information.Then, the Sigmoid activation function is adopted on H H H C and H H H S to get attention matrices C C C ∈ R B×C×1 and S S S ∈ R B×1×L in spatial and channel dimension.By outer product, two attention matrices are converted to a global attention matrix G G G ∈ R B×C×L .This matrix G G G incorporates the attention information in both spatial and channel directions, so it has a global sensory field.Finally, by Multiplying input H H H and G G G with the element-wise product, the global key information in the input is highlighted in matrix R R R ∈ R B×C×L .And a following point-wise convolutional layer aggregates the highlighted information to produce a more meaningful feature representation.

C. LOCAL AFFINITY RECONSTRUCTION
While the GFA modules implement the long-term global attention, the short-term local attention is a key ingredient in the attention mechanism.The LAR module combines the sliding window with the multi-head self-attention (MHSA) [28] to get multiple sets of local affinity, as shown in Fig. 4. First, LAR uses a 1 × 1 convolutional layer to generate the embedding feature V V V ∈ R B×C×L of the input I I I ∈ R B×C×L .Second, a sliding window is applied to reconstruct the features in the local area.Specifically, the central feature B×C×1 of the sliding window is extracted.For creating variant attention relations, the Q and W are reshaped as to get the attention logit P P P ∈ R B×N ×K ×1 and then scored using the Softmax activation function in each head to get the attention matrix A A A ∈ R B×N ×K ×1 .Third, the central feature of a window is reconstructed to a new matrix F F F ∈ R B×C×1 by dot product To accelerate the computing efficiency of LAR, the parallel computation of L matrices F F F is implemented.Through a reconstruction of all sub feature matrices in the original feature matrices I I I , multiple F F F matrices construct the matrix F F F T ∈ R B×C×L .A 1 × 1 convolution kernel is used to fuse the feature information to obtain the output U U U ∈ R B×C×L of LAR.Compared with MHSA, LAR has only O(3N ) computational complexity instead of O(N 2 ), and the reduced receptive field can be complemented by the efficient GFA module.

D. LOSS FUNCTION
For the training samples of (X X X , is the input batch size, and T is the time step of ADE-FDTD equal to 5000), the original mean square error cannot achieve the desired result here and leads to poor performance of all models in the experiment.A new loss function is proposed in this article for the UA problem.The mean and variance information, which are crucial factors associated with quantifying uncertainty, are considered in the new loss function.The loss function is presented as follows  16) where denotes the predicted values.The total loss consists of the original MSE loss L mse , the mean loss L m and the variance loss L v .The L m on a mini-batch of training data consists of the difference between the mean of the surrogate model prediction and the mean of the ADE-FDTD prediction, and the L v is calculated in a similar way.λ λ λ t in the L v is the per time step variance of the total training data, which is used to accelerate the model Fitting the variance of each mini-batch.To ensure numerical stability the minimum value of an element in λ λ λ t is its mean.The loss function ensures that the surrogate model's predictions have similar statistical properties to the ground truth, by limiting the variance and mean of predictions on a batch of data.

A. UNCERTAINTY QUANTIFICATION
The variation in the seven input parameters ∞ , s , A 1 , A 2 , τ 1 , τ 2 , and σ s is increased to 15% here and created through Latin hypercube sampling (LHS).Thus, the robustness of the surrogate model can be further tested when the GPR simulation outputs are more uncertain.To assess the effectiveness of the surrogate model, 100 pairs of (X X X , E E E| z ) were used as the training set and 900 pairs of (X X X , E E E| z ) were used as the testing set.The training and testing sets for the four neural networks are identical.All surrogate models are first trained and tested on data with a 15% fluctuation range and then generalized to data with a 10% fluctuation range without retraining.For a fair comparison, all surrogate models use the same training process and our new loss function.We adopt AdamW [31] optimizer with 0.01 weight decay to train these models for 15000 epochs with a batch size of 64.The optimizer's initial learning rate is 0.01 and is planned by a cosine decay with warm-ups.Before training, all models' inputs are normalized.The mean and standard deviation of E E E | z calculated by MCM, FCNN, CNN, Transformer(The result of Transformer corresponds to MHSA in the legend), and MSAM are presented in Fig. 5 and Fig. 6, respectively.The results of MCM are produced by conducting 1000 full-wave simulations.FCNN is implemented as [9].MSAM applies 4 and 10 heads respectively in the two LAR layers, and the channel dimension of output for GFA and LAR is 100.As shown in Fig. 7 the GFA and LAR are replaced by a 1 × 3 convolutional layer in the CNN.Likewise, in Transformer, GFA, LAR, and BN are replaced by MHSA modules with the same heads as the original LAR, and its structure is shown in Fig. 8. details of the four surrogate models are given in Table 1.''N_Trainable Parameters'' is the number of trainable parameters of the surrogate models.''Total Mult-Adds''is the total number of multiply-add operations required for a model when it predicts an output, which measures the computational complexity of a model.The size of computer storage consumed by the parameters of an artificial neural network during the training process is given by ''Params Size''.Table 2 shows the CPU time consumption of the four surrogate models using the same training and testing sets.
Fig. 5 and Fig. 6 indicate that with the same number of training data as 100, MSAM produces better and smoother predictions than other deep learning methods and has excellent consistency with MCM.The mean values obtained by all methods are in high agreement with the MCM as shown in Fig. 5, which indicates that for deep learning models the mean of E E E | z can be easily learned by training, and also proves the importance of factor λ λ λ t .According to Table 1, MSAM is more efficient than Transformer and FCNN, with a slight decrease in training and inference speed as compared to CNN.The time consumption in Table 1 does not include the time required for ADE-FDTD to create 100 training data, which is typically a lengthy procedure.Therefore, the less training data the surrogate model requires, the more efficient the strategy is.Compared to [1] the proposed method uses far fewer training data, so it is more appropriate for practical applications.Meanwhile, the number of model parameters and the calculation of MSAM shown in Table 2 are relatively low.As a result, the proposed deep learning approach can address the UA problem more accurately and effectively.We also discover that when the input parameters vary in the range of 15%, all models have trouble fitting the source data  variance if trained using the traditional MSE loss function, which is solved using our new loss function.
We reduce the random variation range of the seven input parameters to 10% and create 1000 data samples to examine the resilience of the MSAM to the fluctuation range of the input parameters in the GPR system.Fig. 9 and Fig. 10 show the mean and standard deviation of E E E | z predicted by four surrogate models without retraining.From the results, it can be found that the time domain wave changes when the fluctuation range of the input parameters in the GPR system is changed, but the MSAM can still maintain a high agreement with the MCM method in terms of the statistical properties of the predictions, which proves that the proposed method has good adaptability.

B. SENSITIVITY ANALYSIS
The Sensitivity analysis (SA) of a model is intended to determine the relative importance of each input parameter.To further validate the surrogate model's consistency with MCM, we perform a global sensitivity analysis using the Sobol method [32], [33], [34].The importance of input parameters can be represented by the Sobol sensitivity indices, which are formulated as SI 1,2,...,L represents the major effect of input parameter X i 's contribution to the output variance.V total is the total variance of model output Y , V i is the output variance of varying the X i alone, and f is the model for predicting electric fields.For ADE-FDTD, we created a thousand samples with 15% fluctuations that are assumed to have a Gaussian distribution by LHS.MCM computes Sobol sensitivity indices using the 1000 simulation data from ADE-FDTD.Simultaneously, the trained surrogate model predicts the output of ADE-FDTD using the same LHS samples and estimates the Sobol sensitivity indices without repeating the full-wave simulation 1000 times.Fig. 11 shows the contribution of each input variable for the two methods.It is shown that the contribution order of inputs in MSAM matches well with the MCM method.Such consistency proves the validity of the proposed method and the representational capability of the attention module.In addition, the CNN also maintains some consistency with the MCM, while the other surrogate models seem not to learn useful information.To further compare the results obtained by MSAM and CNN, the difference between the contribution of each parameter in the two methods and MCM is measured using the MSE.The difference between MCM and MSAM is 0.0005, while the difference between MCM and CNN is 0.0019.It can be seen that the results of MSAM are more accurate compared to CNN.

IV. CONCLUSION
This paper provides a new surrogate model for UA in GPR simulations based on the attention mechanism.Two different scales of attention modules are utilized to enhance the model's key information extraction capabilities in different receptive fields.Furthermore, a new loss function for UA is developed, resulting in improved predictions with fewer training samples than in prior arts.The proposed surrogate model can quantify uncertainty in GPR simulation results with large variations in each input parameter and generalize to different parameter variation ranges.The results of predictions and sensitivity analysis are consistent well with the traditional UA method MCM, which needs one thousand of full-wave simulations.In comparison with other deep learning based UA methods, the proposed method outperforms them in terms of accuracy while having lower computational complexity and time consumption.Future work will focus on the extension of the proposed approach to more complex GPR systems.

FIGURE 2 .
FIGURE 2. 2-D GPR system with the dispersive and lossy soil.

FIGURE 3 .
FIGURE 3. GFA module architecture, where Conv is the convolutional layer.

FIGURE 4 .
FIGURE 4. LAR module architecture, where Conv is the convolutional layer.

FIGURE 5 . 6 .
FIGURE 5. Mean of normalized E E E| z , where the variation in each uncertain input is 15%, and the number of training samples for each surrogate model is 100.

FIGURE 7 .
FIGURE 7. CNN model architecture, where Conv is the convolutional.

FIGURE 8 .
FIGURE 8.The architecture of Transformer, where Conv is the convolutional.

FIGURE 9 . 10 .
FIGURE 9. Mean of normalized E E E| z without retraining the surrogate model, where the variation in each uncertain input is 10%.

FIGURE 11 .
FIGURE 11.Sobol sensitivity indices for MCM and surrogate models.

TABLE 1 .
Details of the four surrogate models.

TABLE 2 .
CPU time consumption of the four surrogate models.