A Gaussian Process Based Approach for Validation of Multi-Variable Measurement Systems: Application to SAR Measurement Systems

Resource-efficient and robust validation of systems designed to measure a multi-dimensional parameter space is an unsolved problem as it would require millions of test permutations for comprehensive validation coverage. In the paper, an efficient and comprehensive validation approach based on a Gaussian Process (GP) model of the test system has been developed that can operate system-agnostically, avoids calibration to a fixed set of known validation benchmarks, and supports large configuration spaces. The approach consists of three steps that can be performed independently by different parties: 1) GP model creation, 2) model confirmation, and 3) targeted search for critical cases. It has been applied to two systems that measure specific absorption rate (SAR) for compliance testing of wireless devices and apply different SAR measurement methods: a probe-scanning system (per IEC/IEEE 62209–1528), and a static sensor-array system (per IEC 62209–3). The results demonstrate that the approach is practical, feasible, suitable for proving effective equivalence, and can be applied to any measurement method and implementation. The presented method is sufficiently general to be of value not only for SAR system validation, but also in a wide variety of applications that require critical, independent, and efficient validation.


I. INTRODUCTION
The operation of wireless devices close to the body results in millions of distinct induced electromagnetic (EM) field distributions.Accurate assessment of these fields is a complex task for which different measurement approaches have been proposed and defined in two different compliance testing standards.In this study, we have derived the key requirements to ensure reliable, independently-verifiable, implementationagnostic, and comprehensive validation.For that purpose, a general, Gaussian Process (GP) model-based approach has been developed and is under discussion for adoption in upcoming revisions of the specific absorption rate (SAR) measurement standards [1], [2].The data and the source code of an implementation of the procedure are available at [3], and a corresponding application is accessible through a graphical user interface (GUI) at [4].The approach, which is elaborated and demonstrated in this paper, is believed to be widely applicable, i.e., well beyond SAR system validation, in situations where similar validation requirements exist.

A. SAR MEASUREMENT SYSTEMS
SAR measurement systems measure the induced electric field distribution in phantoms that represent the user of a commercial wireless device.That wireless device is operated in close proximity to the phantom, which contains a tissue-simulating medium.At any location in the phantom, the SAR is related to the root-mean-squared value of the induced E-field, ⃗ E, in the medium through the relation where σ is the frequency-dependent electrical conductivity and ρ = 1000 kg/m 3 is the mass density of the medium.The induced E-field is temporally and spatially assessed in the phantom.Then, spatial and time averaging of the SAR is applied, followed by the identification of the peak spatial-and timeaveraged SAR (psSAR).This value is compared against the psSAR limits set by safety standards such as the International Commission on Non-Ionizing Radiation (ICNIRP), which have been adopted by many regulators.ICNIRP limits are also endorsed by the European Council [5] and adopted in the harmonized standards of the European Committee for Electrotechnical Standardization (CENELEC) [6], [7].Before a wireless device can be sold on the market, the manufacturer is required to demonstrate that the psSAR is below the limit in all tested conditions defined by the measurement standard.ICNIRP has set a psSAR limit of SAR 10g = 2 W/kg averaged over a 10 g cubic mass, applicable to the general public for localized exposures of the head and torso at frequencies from 100 kHz to 6 GHz [8].Some countries, such as Canada, India and USA, have adopted a more stringent psSAR limit of SAR 1g = 1.6 W/kg averaged over a 1 g cubic mass in accordance with the IEEE Standard C95.1-1999 [9].Other limits apply to the limbs, and there are different limits for occupational exposure and whole-body exposure.
The two commonly used SAR measurement systems are scanning systems and array systems.Both systems will be studied in this paper using commercially available products: the scanning system DASY8 [10] and the array system cSAR3D [11], both manufactured by Schmid & Partner Engineering AG (Zurich, Switzerland) as shown in Fig 1. Scanning systems use a robot to mechanically scan a single isotropic E-field or dosimetric probe throughout the human phantom (e.g., head, torso or wrist phantom) [12].Scanning systems are versatile in that the robot can position the probe anywhere in the phantom and at the required density, avoiding the need for field reconstruction.Irregular grids are commonly used to locally improve the resolution in the relevant region of highest SAR.The dosimetric E-field probe is calibrated in well-controlled induced fields [13].Robot positioning repeatability is less than 0.05 mm for DASY8 and thus does not significantly affect the measurement accuracy of the peak spatial-average SAR.
Array systems use a large number of sensors in a fixed grid to electronically scan the field in the phantom without any moving parts and are therefore much faster than scanning systems.However, array systems have higher measurement uncertainty.Mutual coupling between the sensors limits how close the sensors can be to each other, which restricts the measurement resolution.The material inside the phantom (sensors, transmission lines, and supporting structure), causes polarization-and distribution-dependent scattering of the in-duced fields in the phantom, making it difficult to fully characterise and remove scattering-related effects during calibration.Scattering also restricts the sensors to a single plane conformal to the phantom surface, so that array systems are dependent on field reconstruction algorithms to estimate the SAR at locations that cannot be measured.Having no moving parts allows cSAR3D to be hermetically sealed and use a gel instead of a liquid to prevent separation of ingredients over time.The flat cSAR3D phantom studied in this paper has 1024 sensors that measure the three orthogonal field components over a 120 mm x 240 mm measurement area.

B. REQUIREMENTS FOR VALIDATION OF SAR MEASUREMENT SYSTEMS
The requirements for scanning systems and array systems are defined in standards IEC/IEEE 62209-1528 [1] and IEC 62209-3 [2], respectively.IEC/IEEE 62209-1528 is based on a standard that was released in 2001 and has been updated several times to account for changes in wireless technology, measurement system technology, and regulatory requirements.It is broadly accepted by national regulatory agencies.IEC 62209-3 was released in 2019, but regulators have faced difficulties adopting it.The lack of and need for formal demonstration of equivalence have been cited as primary concerns regarding adoption of IEC 62209-3.
Both standards include validation requirements.IEC/IEEE 62209-1528 [1] requires validation of each system component separately and has proven to be very robust.However, such an approach is not possible in IEC 62209-3 [2] as array systems are implemented as sealed boxes without access to the individual components, necessitating a different approach.Furthermore, array systems inherently have a much larger number of uncertainty-affecting degrees-of-freedom (and potential sources of failure), due to the large number of sensors that are independently calibrated for the large number of power levels, frequencies, and modulations needed to accurately measure any mobile wireless communication signal.The standards committee was faced with the issue that a comprehensive set of tests would include millions of configurations, which cannot be practically implemented.The reduced set of standardized exposure conditions currently listed in IEC 62209-3 has proven insufficient, as comparison studies using different array systems show large deviations in the measured psSAR (as much as 5 dB [14], [15]).This has prompted the Joint Research Center of the European Commission to recommend improving the IEC 62209-3 validation [16].
A validation method is needed that satisfies the following requirements, in order to establish equivalence of the standards and drive universal acceptance of SAR measurement systems: • it is universally applicable to any SAR measurement system (device agnostic), • it is able to use knowledge of the measurement system to reduce validation effort, • it ensures that any successfully validated system performs within the reported measurement uncertainty (as prescribed by corresponding standards) for any wireless device, • it empowers the test lab to confirm the validation independently of the system manufacturer, • it identifies critical test conditions that maximize the likelihood of detecting inadequate measurement device performance, • it can be performed with a reasonable effort, to permit re-validation on a periodic basis (or whenever the system is relocated, or hardware or software components of the system change), • it comprehensively covers the space of all relevant exposure conditions and configurations, • it is readily extendable as wireless technologies evolve.To simultaneously satisfy the requirements of comprehensive coverage and reasonable effort, it is necessary to introduce a stochastic component to the approach.Using a subset of all validation conditions selected by a procedure involving stochastic elements each time the validation is performed avoids bias and ensures increasingly comprehensive coverage over time.It also has the added benefit of preventing system manufacturers from primarily calibrating the system in view of a priori known validation configurations such that it would pass the validation but exceed the reported uncertainty.

C. A THREE-STEP VALIDATION APPROACH
This paper presents a three step approach that satisfies all the above requirements and demonstrates its successful application to both scanning systems and array systems.At its center is the elaboration of a surrogate model that estimates the expected measurement error (and the confidence interval associated with that estimate) of the investigated system for a given exposure configuration.An important note is that it is a model of the measurement system error and not of the measurement output itself.The method consists of three independent steps, which are each described in detail in corresponding Methods sections: 1) Model creation: elaboration of the surrogate model using a comprehensive set of measurements, 2) Model confirmation: independent confirmation that the surrogate model established in Step 1 is valid -otherwise, the model must be revised, 3) Targeted search for critical cases: search of the configuration space for regions with non-negligible likelihood of exceeding the maximum permissible error (MPE) using the confirmed model to maximize detection probability, and testing the identified configurations to ensure that the measurement system performs with the required accuracy.
In the context of SAR measurement system validation, model creation would typically be performed by the system manufacturer, while independent model confirmation and the targeted search for critical cases is typically performed by a test lab or the system user (at least once per year; the acceptable effort for each party should be within 1 day).Gaussian Process (GP) modeling is a classical surrogate modeling approach that is capable of estimating system responses on a continuous parameter space with zero-centered error values and of providing associated interpolation uncertainties based on any given set of known values at specific locations.Such a model will be referred to as the 'GP model' or 'surrogate model'.Background information on geometric GP models can be found in the Supporting Information section.For extended background we refer to, e.g., [17] and [18].
The application of the developed approach to a scanning system and an array system in the present study demonstrates the feasibility (with acceptable effort), sensitivity, and generality of the proposed approach.Indeed, the method is sufficiently general to be of value in a wide variety of applications beyond SAR systems that require critical, independent, and efficient validation of system performance.
This method has been proposed for adoption in the next revision of IEC 62209-3 and for a future unified standard that incorporates both and IEC 62209-3 and current IEC/IEEE 62209-1528.

D. STUDY GOALS
The principal goal of this study is to identify and demonstrate a practical, robust, trustworthy, efficient, and comprehensive solution 1) to the specific problem of SAR system validation and 2) to the more general one of efficiently and reliably validating systems that are expected to perform within their uncertainty bounds everywhere in the configuration space.It is not the goal to develop novel GP modeling theory (other than the extensions needed for the search algorithms, such as the δ p (l) function derived in the Supporting Information), nor to identify the best possible approach to GP model creation, validation, and exploitation.The methods applied here were selected based on accessibility and effectiveness for the task at hand, while respecting practical SAR measurement constraints such as: a) being non-iterative (i.e., all measurements are performed in a single session); b) ability to reduce the continuous validation space to a discrete set of measurable configurations (e.g., test labs can only have a finite number of validation antenna sources); and c) respecting the physical parameters of the measurement space (sensor resolution and accessible sensor region).

II. METHODS
This section presents the background of the proposed approach, before introducing the investigated generic application and the real world applications that build upon the test configuration space of [1], [2].

A. STEP 1: MODEL CREATION
In the first step, a GP model of the deviations from the target values is constructed from a measured sample set.The Supporting Information provides background on GP modeling theory, establishes the employed notations, and defines geometric GP models based on a finite set S of known configurations in the n-dimensional parameter space X in R n (whose measured results are denoted S ⊂ R n × R with the added component for the measured values).Many GP model creation approaches exist and can be employed, as long as they are capable of conservatively estimating variances.For this study, we assume a geometric GP model, based on ordinary kriging, with the Matheron estimator for the semivariance, a Gaussian theoretical variogram model, and the default variogram fitting algorithm from the SciKit-GStat package [19].These choices were based on the analysis of the collected data on SAR measurement systems, but alternative choices are of course possible.

B. STEP 2: MODEL CONFIRMATION
Given a geometric GP model, a statistical validation procedure is performed in the second step, after which the model is either rejected or considered trustworthy.This procedure shall be referred to as the model confirmation procedure; it can be performed as a series of successive tests, where each success leads to the next test, and the model is considered valid if and only if the procedure reaches its end.The procedure can be divided into two main phases: goodness of fit and residuals validation.Statistical model validation is a well studied field and valuable information can be found, e.g., in [20].The metrics and tests chosen for this study are commonly employed in statistics.

Goodness of Fit
The goal is to assess how well the model's theoretical variogram γ fits the empirical semivariogram γ = {γ j } k j=1 built from the known sample set S = {(x i , y(x i ))} ⊂ X × R. For each j = 1, . . ., k one defines γ j to be the value returned by γ at the lag corresponding to γj .As opposed to the often used mean absolute error (MAE) or root mean square error (RMSE) the normalized root mean square error (NRMSE) is used as its magnitude does not depend on the sill.
The GP model passes the goodness of fit test when NRMSE ≤ α for a given acceptance value α (typically in the interval [0.1, 0.3], see Parameter Choices).

Residuals Validation
The goal is to evaluate how well a fitted model generalizes to other samples T ⊂ X ⊂ R n whose measured set is T ⊂ X × R. In practice, such a test sample is obtained either as the test subset of a train/test cross-validation-partition of a larger known sample, or as a set of measurements acquired independently from the initial sample that was used to build the model.The residuals of T against the given model are the differences between the measured values y(x) and the corresponding predictions ŷ(x).If the residuals appear to behave randomly and follow the expected distribution, it suggests that the model successfully represents the underlying data.On the other hand, if non-random structure is evident in the residuals, or if their distribution is not the one predicted by the model, it is a sign that the model poorly fits reality.
The validity of the model is evaluated based on a randomly distributed test sample T = {(x, Y (x)) : x ∈ T } ⊂ R n × R such that: • T is independent from S with S ∩ T = ∅, • T is locally (stratified) randomly uniformly distributed, in the sense that an element of T must be randomly picked within a predefined neighborhood (for example within its square if T is generated via latin hypercube sampling (LHS) [21], [22]).This last point differs from S, where global evenness (as opposed to local randomness) is a necessity.The evenness of T is less important, while its randomness is crucial.The independence between T and S is equally crucial.In the extremely unlikely case where some configurations x belongs to S ∩ T after generating T , they should not be used, or T should be resampled, as zero residuals are not allowed in the next phase.
For every x ∈ X , the kriging function returns a pair (ŷ(x), ê(x)), where Y (x) knowing Y (S) follows a normal distribution of mean ŷ(x) and variance ê(x) 2 .Because ŷ(x) is an unbiased linear combination of the elements of y(S) with ê(x) 2 the variance Var( Ŷ (x) − Y (x)), the random variable follows the standard normal distribution.Consequently, many well known statistical tests are applicable.The suggested procedure is as follows: 1) Choose a set T such that T ∩ S = ∅ and T contains 50 elements that are evenly distributed across X with a local randomly generated identically independently distributed (iid) component (i.e.T is obtained through stratified random sampling), 2) Measure the set T and and build T by recording the measured values, 3) Use the Shapiro-Wilk test (see [23]) to determine the normality of Ȳ (T ) with acceptance criterion α = 0.05, 4) Make a QQ-plot of Ȳ (T ) versus N (0, 1) on the [0.025, 0.975] inter-quantile range, and compute the slope and location of its linear regression fit, 5) As the QQ-plot is expected to be linear based on the above step, use the QQ-location µ (the QQ fit value at zero) and the QQ-scale (the QQ fit slope) σ to determine standard normality with acceptance |µ| ≤ 1 and 0.5 ≤ σ ≤ 1.5 (see Sec. II-E).A bad QQ-location indicates a poorly calibrated model.The QQ-scale is the standard deviation of the residuals relative to the ideally expected standard deviation of N (0, 1) (i.e., 1).Slope values below the ideal 1 are an indication of model conservativeness (overestimation) and thus are more acceptable than values above 1 (hence the 1.5 tolerance factor above vs. the factor of 2 below).In practice, a non-negligible NRMSE can be the cause of slope deviation and, therefore, the interpolation error tolerance should increase with increasing NRMSE.By studying different failure scenarios (e.g., artificially constructed, non-conservative GP models) the formula e(x) = ê(x) • (1 + NRMSE), has been heuristically found to be a suitable.

C. STEP 3: TARGETED SEARCH FOR CRITICAL CASES
Given a fully confirmed GP model, the goal of the third step is to search the n-dimensional domain X for regions containing points x that have non-negligible probabilities for Y (x) to cross a given threshold (see MPE definition below) and must be added to the test measurements.These regions are referred to as critical regions and the contained points as critical data points.As X might have a relatively high number of dimensions, a good GP model might need a relatively large initial sample S, which can lead to substantially slower kriging computations.It is therefore essential to establish an algorithm that identifies the critical regions of X efficiently.This requires an algorithm that exploits the available information -as contained in the GP model -to perform its task.The proposed algorithm relies on the delta function δ p (estimate of the shortest distance from a measured point to points where the conditionally estimated value has a probability p of exceeding a given level; introduced in the Supporting Information).
The maximum permissible error (MPE) is defined in IEC 62209-3 [2] for each validation configuration j as where u system,j is the manufacturer declared expanded uncertainty (95 % confidence interval) of the SAR measurement system for the validation configuration j (which may not be larger than 30 % according to [2]), and u source is the expanded uncertainty of the target psSAR value of the validation antenna (which is set to 15 % in [2] based on a tolerance evaluation of the antennas).By adding u system,j and u source directly in (4), IEC 62209-3 treats them as correlated terms, which lowers the validation burden in comparison to treating them as uncorrelated and combining them as root-sum-squared [24].Each validation configuration j must satisfy where ∆SAR j is the error in the measured psSAR and SAR meas,j and SAR target,j are the measured psSAR and the verified numerical target psSAR value, respectively.If ∆SAR j > MPE j for any configuration j, the reported system uncertainty u system,j is not met.The declared u system,j can then either be increased (but not larger than 30 %), or the measurement system is declared to have failed the validation.Since the measurement system is expected to perform within its uncertainty bounds everywhere in the configuration space, the inequality in ( 5) is enforced for all steps: model creation (where ∆SAR j ≡ S), model confirmation (where ∆SAR j ≡ T ) and the search for critical cases (here).

Search Algorithm
The proposed search algorithm is motivated by heuristic optimization methods, with an added uncertainty component.Its goal is to identify critical configurations to be remeasured (search for global extrema), rather than to estimate the global probability of exceeding the MPE or to improve the surrogate model.The considerations that went into designing the search algorithm were that it must: • maximize the chance of detecting configurations with a high likelihood of exceeding the MPE, while minimising the number of required measurements; • balance the need to spread coverage of the search space, against the need to increase the sampling density if either the response surface fluctuates strongly, or the predicted measurement error is close to the MPE such that even small fluctuations could result in a violation of the MPE; • adhere to the general philosophy of established validation procedures from the currently binding standard (point-by-point evaluation according to a pass-fail criterion); • converge rapidly and be amenable to efficient implementation.
The function δ p can be used to build an efficient search algorithm for identifying critical regions in which Y is likely to cross given threshold(s).Let S 0 ⊂ X be a finite, evenly distributed sample over X .Let f be a realization of Y over its domain X = D(f ); it can for example be a kriging function built from the elements of S 0 .Let T − , T + ∈ R denote two thresholds such that T − < T + (for SAR measurement system validation, T − = -MPE j and T + = +MPE j , as given in ( 4)).Given a constant of repulsion q ∈ [0, 1], we have the miterations search algorithm 1.
Even though the search algorithm is not strictly speaking an optimization, it can be interpreted as a multi-search variant of a heuristic global optimization algorithm with an uncertainty component that is simultaneously population and trajectory based: At the ith iteration, the search algorithm produces a new sample S i ⊂ X in such a way that the series (S i ) converges towards a sample that is evenly distributed over (non connected) critical regions.An element of {S i } is denoted S * , and T * denotes an element of {T − , T + }, where * acts as a general placeholder.Programmatically, S * can as well be seen as a mutable set equal to S i at the end of iteration i, and implemented as a sequence of elements S * [j] for j = 0, . . ., |S * | − 1.For a small number of iterations m ≥ 1, algorithm 1 returns a new sample S m in which points of S * have been moved towards local extrema of f in such a way that they are evenly spread throughout the regions that are likely to be close to or beyond the upper and lower thresholds T ± .The elements of population S * are moved according to two kinds of forces: Algorithm 1 Search algorithm: Starting from a sample set S * = S 0 , the algorithm iteratively constructs sets S * ∈ (S k ) m k=1 by moving the elements of S * so as to evenly cover critical regions with significant probability of exceeding the MPE.
1: procedure search(S * ⊂ X , f : X → R, T − , T + , δ p : for j = [0, |S * |) do 8: if y j > T 0 then 10: else 20: for T the closest threshold, • A mutually repulsive force with coefficient q that serves to spread the elements of S * and quickly decreases with distance.The force on any x 0 is a function of its distance to the set S * \{x 0 }.This prevents points from converging towards the same configuration: they should cover regions of interest in a way that maximizes their minimal separation.The choice of p is empirical.It reflects effort-reliabilitybalance considerations and ensures that the algorithm detects threshold violations with a user-defined sensitivity level.The parameter p must be set according to the degree of smoothness and the rate of variation of f over its domain X : if p is chosen too small, points are overly likely to be classified as potentially crossing the thresholds T ± , resulting in a too large number of requested measurements during the targeted search for critical cases.On the other hand, if p is overly large, the crossing condition might not trigger rapidly enough to detect sharp peaks in f , potentially leading to undetected regions.
The parameter q is the constant of repulsion.It is often reasonable to set q = p: with a lower p a finer search is performed and thus a repulsive force that decreases rapidly with distance is needed.For coarser searches over wider regions, points should not be too close to each other, and a higher q ensures that the repulsion acts on longer ranges.Setting q = 0 will remove any repulsion between points.
A simple, model-agnostic space-filling design (such as LHS) is sufficient to initiate the algorithm.The initial sample S 0 used for this study is described in the Supporting Information.
The algorithm can be adapted in various ways.In the present application, knowing that S * starts as a latin hypercube, the choice was made to search along each dimension of X , reducing the chances of multiple points colliding too quickly.If S * is instead chosen to be a lattice with high regularity, one should rather use search trajectories that incorporate a random element.
Assuming that f also returns its estimation uncertainty for for x ∈ S * do 5: The developed validation methodology has been applied to (1) an analytic example where the underlying response is known, for which we assess the ability of the proposed method to successfully identify critical configurations that exceed given thresholds, (2) a subspace of the system validation of cSAR3D to gain an understanding of the performance of the proposed approach, and full system validation of (3) cSAR3D and (4) DASY8.In applications ( 3) and ( 4), care was taken to ensure that GP model creation was performed independently of the model confirmation and the targeted search for critical cases by involving two different measurement laboratories, LAB1 (BNN, India) and LAB2 (IT'IS Foundation, Switzerland), with different operators and different sets of equipment.GP model creation was performed by LAB1, and GP model confirmation and targeted search for critical cases were performed by LAB2.
The set of equipment used in applications ( 2)-( 4) includes a set of validation antennas (see below), equipment for delivery and monitoring of stable power to the antennas (signal generators, amplifiers, directional couplers, power sensors, power meters, cables, and adapters), and hardware for accurate positioning of the antennas while minimizing influence on the near field distribution (spacers, holders, and masks).These are further described, with minimum specifications to ensure high quality results, in the SAR standards [1], [2].
It is important to emphasize that the goal is not to predict psSAR, but to predict the psSAR measurement error, ∆SAR, and to identify configurations where that error is likely to exceed the MPE tolerance (i.e., to identify validation configurations j where ( 5) is not met).

Analytic Sine Wave Example
Let S ⊂ X = [0, 1] 2 be the 2-dimensional LHS sample shown in Fig. 2, and let Y : X → R be such that This process is isotropic and the semivariogram model γ is Gaussian with parameters r γ = 0.97, s γ = 0.22, n γ = 0. Fig. 2 shows this model on the segment line {λ(1, 1) : λ ∈ [0, 1]}.The confidence interval decreases the closer one gets to a known value, and increases where the sampling is scarcer and towards the domain's border.
The search algorithms was applied with varying sensitivities p and iterations m, while fixing the threshold T ± and reusing S as initial sample set, even though a new (and frequently larger) sample set would typically be used.

Dimensions and Range of Parameter Space for cSAR3D and DASY8 Validation
To validate a SAR measurement system requires testing of the system using validation field sources placed in close proximity to the phantom surface.A set of standardized validation sources is available to the test lab and system manufacturer lab.The dimensions of the sampling space must cover all the relevant variables, and each dimension must cover a range sufficient to validate the system for all foreseeable exposure conditions.The measurement accuracy of a SAR measurement system depends on the following SAR parameters (with relevant validation variables in parentheses):

Frequency (f ):
The calibration of the SAR measurement system is frequency-dependent, due to the dispersion of the tissue-simulating medium dielectric parameters and the frequency response of the system components (e.g., amplifiers, filters).Typically, the sensors are calibrated at discrete frequencies and the calibration is interpolated to cover entire bands.To cover the frequency range of IEC 62209-3 (f = 300 MHz -6 GHz), the validation uses 15 dipole antennas to cover 18 frequencies.These are 14 dipoles each operating at a single frequency labeled D300 (300 MHz), D450, D750, D835, D900, D1450, D1750, D1950, D2300, D2450, D2600, D3700, D4200, and D4600, plus a broadband dipole labeled D5000 that operates at 4 frequencies (5200, 5500, 5600, and 5800 MHz).Polarization (θ): The system must be able to measure the induced E-field for any polarization.A measurement system might do this using multiple, e.g., orthogonallyoriented, E-field sensors inside the probe that can be combined to isotropically sense arbitrary field orientations.Imperfect isotropy, i.e., angle-of-incidencedependence, affects the measurement uncertainty and needs to be considered in the validation.To cover the combined space, different sources and varying orientations are defined: 1) sources with a dominant polarization parallel to the phantom surface to test the measurement accuracy of the x and y components.The aforementioned dipole antennas are used for this purpose (see Fig. 3).2) sources with a dominant polarization normal to the phantom surface to test the z-component measurement accuracy.Four VPIFAs (vertically-polarized inverted F antennas, see Fig. 3) operating at a different frequency each and labeled V750 (750 MHz), V835, V1950, and V3700 are designed for this purpose.A PIFA structure [25] is oriented such that the open end is close to the phantom and the short-circuited end is on the opposite end away from the phantom.This provide capacitive coupling at the phantom surface resulting in a dominant normal E-field polarization.
The evanescent fields decay rapidly, resulting in a sharp SAR distribution and thus providing a good test of the capabilitity of the measurement system.Antennas should be rotated at all angles θ ∈ [0 • , 360 These n = 8 dimensions (f , θ, s, P in , PAR, BW, x, and y) are considered to be a sufficient set for the system validation of general SAR measurement systems, while six dimensions are sufficient for scanning systems due to their independence on the device location (x, y).Knowledge about the SAR measurement system implementation could be used to further reduce the number of dimensions and of required validation measurements (improper reduction will result in failure at the model validation step).For example if the scanning system implementation is independent of the rotation of the SAR pattern, the sampling of the rotation angle θ could be reduced or removed.Or, the dimensions of frequency f and distance s could be reduced and/or combined for a highresolution scanning system that is less sensitive to the SAR pattern.Still it must be kept in mind that both characteristics of the measurement system as well as of its reconstruction algorithms must be considered.For this study, the authors chose to maintain the full dimension space to demonstrate device dependence on all parameters.The chosen validation antennas (dipoles, VPIFAs and CPIFA) are described in detail in [2].
Note that each of these dimensions is continuous by nature.Reduction to a finite number of measurable configurations is the result of practical limitations in a) validation hardware (only specific validation antennas and spacers available to test labs, for which target values have been determined, can be used); b) measurement setup (modulated signals must be loaded onto the signal generator); and c) operator ease-of-use (need for manual positioning and orientation of the validation antennas).We examine the case of psSAR measurement error on the (x, y, θ)-subdomain of a cSAR3D device (fixing the other parameters) to assess the ability of the proposed approach to detect systematic measurement deviation patterns despite being agnostic about the measurement system implementation.A single dipole antenna is used at f = 2450 MHz.It is fed at a power level of P in = 29.2dBm with a continuous-wave signal (modulation parameters: PAR = 0 dB, BW = 0 Hz).The dipole is placed at a fixed distance of s = 10 mm on 25 randomly selected phantom surface locations L = {(x i , y i )} i∈ [1,25] .The rotation dimension was covered using 12 rotation angles A = {j • 15 • : j ∈ [0, 11]} at each location (x i , y i ).Therefore, the (x, y, θ)-subcube consists of 300 points For the full cSAR3D system validation, the entire validation procedure was performed, this time covering the complete configuration space described above.A GP model for the full configuration space was constructed at LAB1 from a set of cSAR3D flat phantom measurements.
Initial sample generation: An initial sample S of 400 data points (see Parameter Choices) was LHS-generated with a maximization of the minimal distance between any two points, so as to ensure that S was well spread within X .For the chosen configurations, the 1g-averaged psSAR (SAR 1g ) was measured using a cSAR3D device (flat phantom), and the corresponding deviations ∆SAR 1g from the published target values were computed, resulting in the valued sample S.
Outlier detection: Potential outliers were detected and their measurement values double checked to eliminate any operator errors.An outlier is defined to be any value not in the set where q = q 3 − q 1 for q 1 , q 3 the first and third quartiles of y(S), and where r is a positive predetermined interquartile range multiplier chosen equal to 2 in this case.With r = 2, a few valid values might still be classified as outliers.Outliers are not to be ignored in applying (5).Nor are they ignored from the linear system of equations used for interpolation, but they are to be ignored during the construction of the final isotropic variogram.
Model creation: The data space was prescaled based on the standard deviations of the known values y(S) = Y (S) along each dimension; i.e., for Y (S) i the projection of Y (S) on the 1dimensional R-subvector space generated by the i-th variable, and for s i the standard deviation of Y (S) i , the invertible linear map ι 0 is pre-constructed as the diagonal matrix Working from ι 0 (X ) not only normalizes the arbitrary choice of units, but also reduces the conical uncertainties in the construction of the 1-dimensional directional variograms along each dimension (lag pairs are chosen with an angular tolerance -set to 45 • in our implementation -around the direction-of-interest to improve the statistics).A Gaussian theoretical variogram is used for the fitting and, due to the very different characteristic length of each variable's variation, only the range, sill and nugget need to be determined along the eight directions of the canonical base vectors of the surrounding space R n .The sill was found to be similar along all directions, so that ι was defined as the composition where r i is the range of the directional variogram along dimension i.Now that ι(X ) is isotropic, an 8-dimensional semivariogram γ on ι(X ) can be estimated.

Model confirmation:
To confirm the model, we first quantify its NRMSE value to assert that the isotropic semivariogram γ fits the empirical semivariogram γ sufficiently well within an acceptance threshold of 25 % (see Parameter Choices).Next, the GP model is tested using measurements taken at LAB2, an independent laboratory with a different operator and different equipment than those involved in the model creation performed at LAB1.An appropriate random test sample T of 50 points is generated and the residuals are obtained from the values in T using (3).Provided T has good randomness, the residuals are to be standard normal distributed.The Shapiro-Wilk test with a p-value well above 0.05 did assert normality, and the QQ-plot of the order statistics of the residuals versus the theoretical standard normal distribution were assessed.
Targeted search for critical cases: The confirmed GP model was used to search the data space for critical regions where psSAR deviations to the target values (∆SAR 1g ) are VOLUME 11, 2023 likely to exceed the MPE.A sample of appropriate size and distribution was LHS-generated and used to initiate the search Algorithm 1. Eight iterations were performed and the configurations were returned where the probability that ∆SAR 1g > MPE is at least 5 %.After the last iteration, snapping to the closest meaningful neighbour was performed (see Methodology and Implementation).These critical cases were then measured to check whether or not the values exceed the MPE.
Measurement of critical cases: The identified critical configurations were measured on a cSAR3D flat phantom to determine whether the measurement system passed the validation (i.e., ∆SAR 1g ≤ MPE).This was performed by LAB2 with different equipment and operator than those used by LAB1 for the model creation measurements.

DASY8 Validation
For the full DASY8 system validation, as in the case of cSAR3D, a GP model was constructed based on system knowledge and an LHS-generated initial set S system knowledge.This use of system knowledge does not compromise the validation as the model confirmation step, which requires the measurement of a test set T , is carried out independently.The model creation step was identical to that of the cSAR3D case, except that a non-zero nugget was introduced as model parameter, since the nugget was found to be non-negligible compared to the sill.The model confirmation and the targeted search for critical cases were conducted for DASY8 in the same way as for cSAR3D.

E. METHODOLOGY AND IMPLEMENTATION
Selected implementation-specific aspects, such as parameter choices, are discussed here.

Parameter Choices
The proposed procedure involves a number of tuneable parameters.To establish the GP model based on a LHSdistributed initial sample, a size of 400 is usually needed to ensure that 75 % of the 50 bins used in the construction of the empirical semivariogram contain at least 40 lag values.The Gaussian semivariogram model was chosen from the smoothness of the underlying process (a result of the smooth dependence of the measurement physics on variations of the underlying parameters).It also has the advantage of being less sensitive to variance changes at the smallest lags, where the LHS generated initial sample provides few to no values.A value in the range 10-30 % is typically chosen for the NRMSE tolerance in the fit validation of the model confirmation step.Data analysis has shown that a tolerance below that range is too severe.As explained in [23], the Shapiro-Wilk test is best applied to samples that have 20-50 elements, while at least 50 points are recommended for the QQ-plot to be meaningful.The usual tolerance of 5 % is applied; this could easily be increased for more severity, however, normality itself is less important than the scale and location factors of the QQ-test.The QQ-test location and scale tolerance are set at |µ| < 1 and 0.5 ≤ σ ≤ 1.5 after normalization to the standard deviations (3).In order to accept more conservative models, the tolerance for σ is more permissive below than above the ideal 1.The constant of repulsion q in the search algorithm should typically be chosen in the range of 0.05-0.2,depending on the expected global smoothness of ∆SAR over the configuration space.A default value of 0.1 has been found to work for the types of measurement systems studied, The number of iterations m in the search algorithm is set to 8, as this proved sufficient for the configuration samples to converge to the critical regions.Finally, the initial search population size is in the range 50-10000.The search algorithm can efficiently handle such a potentially large number of trajectories.

Latin Hypercube Sampling Implementation
In order to efficiently sample S for the initial GP model creation, the choice made was to use Latin Hypercube Sampling (LHS) (see [21] and [22]).For a unified LHS procedure to generate suitable sample sets S of size k for both the model creation and the model confirmation step, the following conditions need to be satisfied: • The cube I n = [0, 1] n is partitioned into the canonical grid of k n equally sized n-dimensional sub-cubes ('cases').The elements of S are placed in their own 'case', such that each row along each of the n dimensions contains exactly one element of S, and that the minimal euclidean distance between two occupied squares is maximized, • each element of S 0 is placed within its 'case' according to a uniform probability distribution.
While the first condition guarantees a good initial sample, the second condition is essential for it to constitute an appropriate test sample.A custom developed implementation of LHS based on pyDOE [26] was used for this purpose.

Data Snapping
Not all configurations in domain X are valid (i.e.physically measurable).The general approach is to treat X as a continuous connected subset of R n , and to derive meaningful values a posteriori (after the last iteration) through snapping to their closest meaningful neighbour.In this way the search algorithm does all its works on X , after which a final kriging round is performed to the snapped configurations.In rare instances, if an element is too far from a meaningful location, it is removed from the resulting critial set.This approach avoids having to treat categorical variables, discrete variables and continuous variables differently in the validation procedure, and greatly simplifies all statistical operations without significantly affecting the results.As for initial samples generation, the source selection is based on the (frequency, distance)-pair.If a frequency is valid for different source types, distance is used as a secondary criterion.When no meaningful source exists, the sample point is ignored.

Variogram Modeling
All semivariograms were generated using the scikit-gstat 1.0 package from the pypi repository; all details on the methods used are provided in [19].The Matheron semivariance estimator given in ( 7) is used for the empirical variogram construction.The Gaussian semivariogram model defined in (9) was found to be the most effective model.The convex nature of the model at short ranges reduces the fitting uncertainty associated with the typically sparse sampling data at the shortest lags.Most importantly the monotonicity of the semivariogram model ensures that the delta function is applicable.The domain is binned such that 75 % of its diameter is partitioned into 25 equally-sized bins for the 1-dimensional directional variograms and into 50 bins for the final isotropic variogram.While zero nugget models worked for cSAR3D, it became obvious that nugget-inclusion was necessary in the DASY8 case.

A. ANALYTIC SINE WAVE
The application of the targeted search for critical cases (using T + = 0.20 and T − = -0.75)for various sensitivities p, and iterations m, moves the elements of S as illustrated in Fig. 4 and Fig. 5 respectively.The impact of the p value is illustrated in Fig. 4. The fact that δ p incorporates the relevant semivariogram characteristics allows the search algorithm to remain efficient with a minimal number of iterations (provided the semivariogram model is of sufficient quality).Each of these points come with their own probability to cross the thresholds.Fig. 6 shows these probabilities for the lower threshold of T − = −0.75, which is identical to the infimum of f on X .

B. CSAR3D (X , Y , θ)-SUBCUBE
The results from applying the developed procedure to the (x, y, θ)-subdomain of a cSAR3D device provide valuable insights into how regions can be identified that -as a result of the measurement system design (in this case, the arrangement of sensors in the array) -have an increased likelihood of exceeding the accuracy limits.The set of measured configurations S = {(x k , y k , θ k , ∆SAR k )} can be represented by projecting all measurement errors ∆SAR k onto each dimension: their mean and standard deviation are shown in Fig. 7.
Note that the locations are not as optimally distributed (i.e., locally random, but globally homogeneous) as if they would have been LHS-generated.This is compensated by the high number of elements in S and the high regularity of all standard deviations across the board.The data shows a smooth geometric anisotropy with low noise.By definition of geometric GP models as given in (11), one can compute an isomorphism ι via which the underlying data space X can be rescaled into an isotropic space ι(X ).An isotropic semivariogram γ built on ι(X ) is shown in Fig. 8.The NRMSE of less than 12 % -well below the 25 % acceptance limit -indicates that the semivariogram is well suited for probing X .Applying the search algorithm with a very low MPE threshold of 0.3 dB (well below that allowed by the standards; see (4)) using different sensitivity values p results (after only 8 iterations) in the critical samples shown in Fig. 9. Color clustering is apparent in Fig. 9, which corresponds to the local extrema along the θ dimension from

C. CSAR3D VALIDATION
The isotropic semivariogram γ fits the empirical semivariogram γ with an NRMSE value of about 10 %, well below the acceptance threshold of 25 %.Typically, a good model has an NRMSE in the range of 8 % to 15 %, which is the case here as shown in Fig. 11.The Shapiro-Wilk test confirms that the residuals are indeed normally distributed, with a p-value of 0.29 > 0.05.As shown in Fig. 12, the linear regression of the QQ-plot of the residuals order statistics versus the theoretical standard normal distribution has its location and scale well within [−1, 1] and [0.5, 1.5], respectively.The confirmed GP model was then used to search the data space.The critical configurations returned after 8 iterations of the search algorithm with an attributed probability of at least 5 % of exceeding the MPE value are listed in Table 1.These 44 configurations are sorted by decreasing probability of exceeding the MPE (from 18.8 % to 5.1 %).A cluster of cases using the D5000 dipole at a distance s = 25 mm with input power levels around P in = 10 dBm is evident, as well as two cases using the V750 antenna.This is unsurprising, as the D5000 and V750 antennas have the sharpest SAR distributions among the sets of dipole antenna and VPIFA, respectively.A sharper SAR distribution results in a larger measurement variability for an array system with a fixed sensor resolution.where there is at least a 5 % probability (column labeled Fail prob.) that the inequality in ( 5) is not met.Each configuration is identified by its 8-dimensional coordinate in the validation parameter space (f , P in , PAR, BW, s, θ, x, y ), while Ant.name is the validation antenna operating at the listed f and s combination).The three right-most columns are estimated from the GP model: the GP model error (∆SAR 1g ); the standard deviation of the GP model error (Model error), and the failure probability (Fail prob.).Since the MPE of ( 4) and ( 5) is only known after the measurement is performed, it was conservatively estimated from the MPE values reported during model creation.
The 44 identified critical cases were measured in LAB2.The results are shown in Fig. 13.It is observed that ( 5) is met for all measurements.Thus this cSAR3D flat phantom has passed all of the validation criteria, and is considered to be fully validated for measurement use.Note that this only validates the individual cSAR3D flat phantom and not the class of all cSAR3D flat phantoms, since measurement accuracy is dependent on manufacturing tolerances and calibration quality.Each phantom must be individually validated.

D. DASY8 VALIDATION
The main differences in the results obtained with the DASY8 compared to cSAR3D stem from the fact that the deterministic component of the isotropic variogram is now small compared to the nugget (not as in the model obtained in the cSAR3D case; see Discussion).The obtained isotropic semivariogram and good fit validation are summarized in Fig. 11.As expected, given the small sill, the NRMSE is larger than in the cSAR3D case, but it is still comfortably within FIGURE 13.Deviation (dots) of the SAR 1g measured on cSAR3D from the published SAR 1g numerical target for the 44 critical configurations identified in Table 1.All of the results are within the MPE j limits (black lines), such that the measurement system passes the validation.
the 25 % acceptance threshold.The residuals validation test results (see Fig. 12) further confirm the GP model.As a result of the prominent noise level (compared to the systematic deviations), the Shapiro-Wilk test p-value is excellent.The slope of the QQ-plot regression is slightly below 1, which means that the model is on the conservative side (the estimated interpolation errors are in average slightly larger than those obtained through measurements), which pushes the search algorithm to use a larger initial sample and is likely to increase the validation effort.No configuration with a probability above 5 % of exceeding the MPE value was found, which did not come as a surprise, considering the precision of the system-under-validation.In other words, the targeted search for critical cases was performed and no critical points were returned.Therefore, the evaluated DASY8 system (including the specific probe and phantom used) is considered to be validated.

A. REVEALING DEVICE-SPECIFIC FAILURE RISK HETEROGENEITY
When analyzing the search algorithm performance for the cSAR3D (x, y, θ)-subcube GP model using intentionally lowered -i.e., stricter -MPEs, clear clustering of the revealed configurations at risk of exceeding the MPE are evident in Fig. 9 and Fig. 10.The clustering around specific spatial locations and angular orientations reveal how a combination of sensor calibration variability and field reconstruction methodology results in a heterogeneous distribution of the measurement error throughout the configuration space.Despite using a validation approach that is device agnostic, the GP model-based approach is capable of revealing such variations, and of assessing the probability of measurement errors exceeding the tolerances.

B. BEHAVIOR COMPARISON BETWEEN CSAR3D AND DASY8
For the DASY8 system, the deterministic component of the isotropic variogram is almost negligible compared to the nugget.The latter results from thermal noise, amplifier instability, stochastic placement and other operator inaccuracies.The cSAR3D and DASY8 system nuggets nearly have the same magnitude, indicating that their noise levels are comparable.However, in the cSAR3D case, the nugget only represents a small portion of the sill, indicating that systematic (i.e., reproducible) deviations from the target value cannot be neglected compared to the measurement noise anymore.Importantly, the GP model approach performs correctly in both situations and remains as effective in the presence of a high relative amount of noise as in the almost deterministic cases.This further strengthens the confidence in the general applicability of the developed validation approach.
The possibility of augmenting the GP model quality using manufacturer knowledge, without compromising the trustworthiness of the validation (thanks to the independently performed model confirmation step), could also be demonstrated for the DASY8 system.

C. BENEFITS OF THE DEVELOPED METHODOLOGIES
The developed validation approach offers a range of important and valuable features: • Validation can readily be performed by an independent party; in fact, different parts of the validation can be performed by different parties, thus maximizing trust in the validation, • It maximizes the likelihood of detecting potential configurations that violate the MPE limit, • It incorporates stochastic elements, which ensures comprehensive coverage over time, avoids bias, and prevents preferential calibration for known validation benchmark configurations, • It is a device-agnostic approach where no devicespecific knowledge is assumed, such that is ideally suited for harmonizing the divergent standards of scanning-and array-based systems; at the same time, manufacturers are free to use their knowledge to reduce the GP model generation effort.• It can be performed and repeated with a reasonable effort (one day of measurements) by a suitable test-lab or the device user.• It can be implemented in a software tool that can easily be used without any knowledge of the underlying mathematics (see the implementation accessible at [4]).
The proposed validation method has been demonstrated on two different SAR measurement systems.Beyond the benefits for validation purposes, the developed methodologies also offer: • an efficient approach for exploring high-dimensional parameter spaces with a very small training set that furthermore is generated in a non-iterative manner (important in the given application context, as a single measurement session is desired; this is, e.g., not the case for the sophisticated approaches from [27] or [28]); • a search algorithm that exploits knowledge of the variogram at each iteration for increased computational efficiency; this implies the ability to use large search populations, with no need for elaborate initial designs based on refined models.Increasing the search population allows to detect even highly local anomalies.

D. COMPARISON TO OTHER APPROACHES
Comparison to reliability theory: The present approach addresses a need that is not typically covered by reliability theory approaches (such as [29], [30], or [31]), where the determination of the reliability of a system relies on the overall probability of failure expressed as for F ⊂ X the region of failure, I F : X → {0, 1} the indicator function of F, and p some probability density function on X .
In our application, we are not interested in the overall failure probability over a space of events.Instead, we want to find those x ∈ X where the deviation y(x) has a high probability q(x) of exceeding the MPE.In an optimal scenario the failure probability would be expressible as for q(x) the probability of the value y(x) to violate the MPE.
In this situation the failure detection problem could be formulated as an optimization problem: This however is not suitable for our purpose: as not all configurations x are physically measurable (only a non-dense countable subset of X is), we are interested in regions of failure rather than individual points of failure.We therefore need an algorithm that outputs the subset F ⊂ X whose individual elements have a significant probability to violate the MPE.As explained above, F is typically disconnected and connected components can be arbitrarily small.In general, no GP surrogate will guarantee to successfully model all potential outliers; our geometric GP model assumptions are meant to provide a context in which it is reasonable to assume that a large part of F can be found.In the SAR validation case, data analysis has shown smoothness and good geometricity of the underlying physics, and the present method offers a major improvement over the published IEC 62209-3 standard which only requires a predefined set of configurations to be validated.
Comparison to Bayesian methods: Bayesian methods typically rely on an acquisition function which is based on the current information provided by the surrogate model.The search process is then guided to either improve the surrogate model or to find more optimal function values.In our application, the modeler is free to use any Bayesian method to iteratively build the GP model within Step 1 of the proposed approach, but from the point of view of a test lab independently executing Step 3, model improvement is not allowed.Because of that, a separate acquisition function such as expected improvement is not necessary; it suffices to directly apply the search to the fixed GP model.
The iterative model improvement of Bayesian methods allows the search to neglect the uncertainty of the the covariance function (i.e.variogram fitting) as component of the interpolation error.Since the surrogate model is fixed and we cannot impose an optimal level of accuracy (unlike iterative approaches), the covariance error may lead to overestimated confidence in the predicted values distributions.With the proposed approach, the incorporation of the variogram RMSE into the resulting estimated error alleviates this problem.The approach embraces that the fitted variogram is an imperfect approximation of the empirical one, accepts a larger fitting tolerance, and adds a safety factor to avoid underestimation of the uncertainty.This manifests in improved capture of the QQ plot slope.
Since the method goal is conservative estimation of failure risk, rather than generation of a highly accurate model of device performance, it strikes a different trade off between efficiency and accuracy.While Bayesian optimization methods typically rely on reducing uncertainty by optimizing an expected improvement function at each iteration, our approach accepts estimated model uncertainty as is, and it locates the parameter value regions most at risk of failing without reevaluating the underlying truth or adapting the error prediction.As a result, it essentially scales linearly with the number of trajectories.Even though the required number of trajectories is affected by the dimensionality of the problem, it primarily depends on the global proximity of the model to the failure thresholds.Thus, it will remain small for systems that are unlikely to fail, which allows the present approach to scale favorably with dimension.
Comparison to Monte-Carlo methods: The search algorithm should be fast and efficient, notably in order to allow for a large initial search population when the system performance is known to be close to the MPE.On the other hand, when performance is far from the MPE, critical regions can become very small, in which case Monte-Carlo methods often suffer from the curse of dimensionality.The δ-function (which plays a role similar to the δ in the classical analytical definitions of continuity and convergence) naturally permits the use of particle velocities that adapt to the likelihood of exceeding the MPE, thus accelerating evolution towards critical regions, while reducing the risk of missing (or not converging to) small critical regions in the configuration space.

E. APPLICABILITY TO MODEL VALIDATION
The elements of the presented approach can be applied for diverse forms of model validation (e.g., in data-driven modeling and machine learning), where the GP surrogate Y is used to model an error function e : X → R knowing the errors at locations S between the true values and the predicted values returned by the model to be tested.In this context, assuming a continuous and ''smooth'' response curve, the residuals VOLUME 11, 2023 validation step (notably the QQ-plot) can be applied to locate the outliers that break the assumption of smoothness and fail to be predicted by the GP surrogate, and the search algorithm returns the regions of the parameter space with critical errors.Outliers and critical cases are then used to either validate the model or locate the regions on which the original model is to be improved.
In the case of SAR measurement systems, the variance anisotropy is relatively mild and the linear transformation to an isotropic version of the parameter space does not disfigure latin hypercube samples into hyperrectangles whose support lengths substantially vary along the different dimensions.
In general, the model uncertainty will increase along overstretched dimensions; the decrease of information along these dimensions will increase the uncertainty of the model, and as a result of approaching the failure threshold increase the computational time of the search algorithm by requiring a potential significant increase in the number of trajectories.This may affect the performance of the method, but should not affect the outcome of the validation.The issue can be alleviated by applying sampling techniques that distribute points according to an a priori estimated isotropization.

F. LIMITATIONS AND EXTENSIBILITY
The present approach can readily be extended in view of future evolutions of the measurement standards, e.g., by adapting the configuration space to include other frequencies, modulation schemes, or antennas.Also, if future devices prove to break certain assumptions on the model, or show the need to refine the search, the proposed validation approach remains open for any of its components to be generalized or improved.
Assumptions: The space is geometrically anisotropic and can be made isotropic through linear transformation; if that is not the the case, nested model, multi-fidelity models, or multiple models valid in subspaces can be used.In addition, a zero drift process is assumed throughout the space (otherwise, universal kriging is needed).Stationarity, monotonic semivariograms, seperatability and continuity are assumed (see above).Various generalizations are possible at the cost of increased complexity, which includes, replacing ordinary kriging by universal kriging, geometric anisotropy by zonal anisotropy, and treating continuous domains with categorical variables.
Operator variability: it is included in the noise and nugget, but is likely operator specific (operator error might anyway trigger remeasurement if outliers are apparent).Poor device consistency (manufacturing or calibration) will result in failure of confirmation step that could be remedied by creating a specific GP model for that device.
Measurement efficiency: The LHS nature of S makes the measurement process slow as each sample point requires a completely different measurement configuration.However, the user can sort S by antenna type (switching antennas is the most time consuming task) to improve measurement effi-ciency.This is an advantage of our approach over an iterative approach.
Other application-related aspects: The current study focuses on flat phantoms and still needs to be applied to the slightly more complicated situation in head phantoms, where the lack of translational symmetry increases the configuration space.There are other aspects of system validation that are taken care of outside the GP model of this paper.Simultaneous transmission of multiple signals (as done using 5G signals with carrier aggregation to improve bandwidth, for example) is validated separately.Another example that is validated separately is the validation of dynamic power control systems of mobile phones, where the signal amplitude is adjusted over time to keep the psSAR below the regulatory limit.
Surrogate modeling approach: GP modeling is a wide field in which more sophisticated approaches for model creation, validation, and exploitation exist than the ones employed here.For instance, it might be preferable to use maximum likelihood or cross-validation to select the covariance parameters, or the negative log predictive density (NLPD) could complement the RMSE in the quality assessment.In fact, other non-GP surrogate modeling approaches could be employed, as long as they are capable of providing conservative variance estimates (the validation step will need to be adapted if ordinary kriging is not used).However, in view of the purpose of this study -namely identifying a practical solution to the specific problem of SAR system validation and to the more general one of efficiently and reliably validating systems with a large configuration space in an unbiased, implementationagnostic and independently verifiable manner -choices in the current study were dictated by the benefits listed above, as well as the simplicity and ready availability of suitable routines, and are justified by the successful application.

V. CONCLUSION
A general, robust, trustworthy, efficient, and comprehensive validation approach has been developed that can operate agnostically of the technical implementation of the tested device, prevents calibration that favors known validation benchmarks, and is applicable to large configuration spaces.Its applicability, suitability, and strength has been demonstrated through rigorous validation of two technologically completely different SAR measurement system types, a scanning system and an array system, and involved two different laboratories for separate steps in the process to illustrate how the approach permits independent verification of the validation.
The proposed approach resolves the current problem of unifying and demonstrating equivalence of the IEC standards [1] and [2].At the same time, it establishes a process by which any laboratory or user can evaluate at any time and with reasonable effort (less than one day) that the system performs within its reported uncertainty bounds using the commonly available, standardized set of antennas for which target values are specified by the standard.At the same time, the generality of the method ensures that the set of validation antennas can readily be extended or adapted without affecting it, as long as the exposure space is comprehensively covered.
The proposed validation approach is general (i.e., not specific to the SAR measurement system) and has been made available as a simple software tool [3], [4], so that it can easily be followed despite the complexity of the underlying mathematics.Overall the procedure will improve the quality, reliability, and reproducibility of the assessment of wireless devices conformity with safety regulations which benefits the public, government agencies and industry alike.
• the Spherical model We can now define a geometric GP model to be a commutative diagram endowed with a monotonic isotropic semivariogram γ : R + → R + that fits Y ι on S, where: • S = {x i } is a finite subset of X ⊂ R n ; • Y = {y : X → R} is a geometric, separable, stationary, and weakly continuous Gaussian process with probability space Ω; • Y ι is isotropic with ι an invertible linear map in GL n (R).The commutativity of the diagram implies the process is fully known on S, with Y (x i , ω) = y(x i ) =: y i for all x i ∈ S, ω ∈ Ω, y ∈ Y .Writing S = {(x, Y (x)) : x ∈ S)} ⊂ R n × R, a geometric GP model is characterized by the triple ( S, ι, γ) which can be used to fully represent the model in a software implementation.Under the geometric GP model assumptions, S is sufficient to build a full model.The approach is as follows: 1) Given S, one can build ι by probing spatial dependencies along all possible (1-dimensional) directions in the data space.Under the geometric assumption of a GP model an iso-variance contour forms an (n − 1)-dimensional ellipsoid in X .This ellipsoid can be mapped to an (n−1)sphere via a series of rotations and axis rescalings.The composition of these invertible linear maps on R n form ι : R n → R n which can be represented as a matrix in the canonical base of the data space.2) Given S and ι, one can compute an empirical semivariogram γ on the isotropic space ι(X ), on which a semivariogram γ can be fitted.Most often, the model and binning are set a priori -they depend on the application field -, while the range, sill and nugget are the result of a fitting algorithm.
A geometric GP model can then be used for spatial inference at any unobserved location via ordinary kriging.The unknown value Y (x) and the (unbiased) ordinary kriging estimator Ŷ (x) are interpreted as random variables located at x.The linear combination ε(x) := Ŷ (x) − Y (x) is a Gaussian random variable with zero mean, and for σ 2 (x) the kriging estimator variance, the variable follows the standard normal distribution.

B. THE DELTA FUNCTION
Let T ∈ R be a fixed threshold and M = ( S, ι, γ) a geometric GP model.For {Y (x) : X → R} the underlying geometric Gaussian stationary process over X ⊂ R n known on S = {x i } k i=1 , and for x 0 a point in X , a measure needs to be established of how far away from x 0 the next candidate point x must be to have a reasonable likelihood of Y (x) crossing the threshold T in the sense that: for the lag h = x − x 0 ∈ R n .Along a given direction, γ is invertible on the interval (0, r γ ] with s γ := γ(r γ ) for r γ the (effective) range of γ.Then the continuous monotonic function δ p : R + → R + can be defined as where for n γ the nugget of γ.Knowing the values T and y 0 one can choose a parameter p and use δ p (T − y 0 ) to find the closest points to x 0 with a probability ≥ p to cross the threshold T .By construction the function δ p encodes all the covariance characteristics carried by the variogram: • Parameter p expresses the sensitivity of the model in the assessment of whether or not the true value y(x) is considered to have crossed T .

27 :
return S * , Y * the nearest T ± .Once Y (x) crosses that threshold, x's velocity decreases and the location of the elements of S * start to accumulate in the regions surrounding nearby extrema of Y .The force is a function of δ p (|y(x) − T |)

Algorithm 2
one can use algorithm 2 to return the two subsets L * , U * ⊂ S * of points that have a probability of at least p to cross the corresponding thresholds.Filter algorithm: Starting from a sample set S * covering the critical regions and a GP model f , the algorithm partitions S * = L * ∪ U * ∪ M * according to its p-likelihood of exceeding T − , or T + , or none of them, respectively.

FIGURE 2 .
FIGURE 2. Latin square sample S ∈ [0, 1] 2 of size 50.The estimated interpolation Ŷ (x) along the indicated diagonal is shown below: The red curve is the exact noiseless function f , the blue curve is the predicted interpolation f along with the associated 99 % confidence interval.
• ) with respect to the phantom surface normal.This can be reduced to [0 • , 180 • ) or even [0 • , 90 • ) without loss of rigor for antennas having SAR patterns with one-fold (VPIFAs) or two-fold (dipoles) reflectional symmetry, respectively.The range θ ∈ [0 • , 180 • ) with steps of ∆θ = 15 • has been used for both antennas for simplicity.SAR pattern (s):The measurement accuracy can depend strongly on the spatial distribution of the SAR, due to the field reconstruction algorithms, sensor design, sensor spatial resolution, and sensor distance to the phantom surface (which affects the capability to reliably measure rapidly decaying fields).The measurement system must be validated with sources that cover the range of potential SAR distributions from wireless devices.This is validated by using different antenna types and frequencies, as explained earlier.Additionally, dipole antennas are tested at distances from the dipole axis to the phantom lossy medium of s = 5, 10, 15, and 25 mm using spacers (s = 5 mm is excluded at 300, 450, and 750 MHz due to their thicker dipole arms).To test SAR patterns from wireless devices having multiple SAR peaks, a dual-peak antenna is used.It is known as the CPIFA (centrally-fed inverted F antenna) and operates at 2450 MHz (labeled C2450).SAR level (P in ):The measurement accuracy depends on the SAR level, due to non linearities of the SAR measurement system that are compensated through calibration.Therefore, the system needs to be validated across the system's dynamic range.Input power to each antenna, P in , is varied over a 20 dB range in 1 dB steps such that SAR 1g varies from 1 -100 W/kg for CW (continuous wave) signals, and 0.1 -10 W/kg for modulated signals.Reliability should be tested up to an upper bound of 100 W/kg for the local SAR, which corresponds to the extremity 10 g-averaged SAR limit of 4 W/kg for any SAR distribution.P in values depend on the source type, frequency and distance.Modulation (PAR, BW): Sensors are calibrated using different modulated signals commonly employed in wireless devices (e.g., 3G, 4G, 5G, and WLAN signals).The signal bandwidth, BW (in MHz) is an important signal parameter in the frequency domain, while the peak-toaverage ratio (PAR, in dB) describes signal aspects in the time domain.To cover ranges of PAR = 0 -12 dB, and BW = 0 -100 MHz, a set of 22 common modulations has been selected that include two 3G signals, ten 4G signals, six 5G signals, and four WLAN signals.An unmodulated (CW) carrier has been included to test the upper end of the dynamic range, and a pulsed CW signal with a 10 ms period and 10 % duty cycle has been included to test pure pulsed signals.Location (x, y): Array systems must be validated at any x and y locations within the measurement boundaries, because the measured SAR varies due to a) mechanical tolerances and calibration errors of the different sensors in the array, b) measurement variations related to the discrete field sampling by the sensors, and c) the influence of field scattering across the sensor array that may be different in the center of the array compared to the array boundaries.During validation, x and y are varied over the full measurement area using a 1 mm resolution to include locations between the sensors.

FIGURE 3 .
FIGURE 3. Dipole (left), VPIFA (center), and CPIFA (right) validation antennas used in SAR measurement standards [1], [2].The dipole and VPIFA test the measurement of parallel and normal polarization with respect to the phantom surface, respectively, while the CPIFA tests the ability to measure multiple peaks.
Fig. 5 illustrates how after only two iterations the points have already converged to cover the regions-of-interest.These figures show the remarkable regularity in how points are set apart from each other.

FIGURE 4 .
FIGURE 4. Results obtained using Algorithm 1 for the targeted search for critical cases.Identified candidate configurations are shown for various values of the sensitivity p (always with m = 8 iterations).A higher p reflects the expectation of a smoother response surface, which permits to increase the spacing between sampling points and to search the configuration space more sparsely.

FIGURE 5 .
FIGURE 5. Impact of the iteration number m on the results returned by Algorithm 1.The sensitivity parameter is set to p = 0.1 throughout with an increasing number of iterations m.Quick convergence up to m = 8, then stabilization beyond m = 8 of the identified failure candidate is observed.

FIGURE 6 .
FIGURE 6. Configurations for remeasurement, identified by the complete search procedure (algorithms 1 and 2).The probability of the samples to fall below the lower threshold of T− = −0.75 is indicated.

FIGURE 7 .
FIGURE 7. The set -S of the measured cSAR3D deviations from the target values, ∆SAR 1g , projected along the different dimensions of the (x, y , θ)-subcube configuration space.The bars denote the standard deviations of all elements of -S with the same projected parameter value.

FIGURE 8 .
FIGURE 8. Empirical (dots) and fitted (line) isotropic semivariograms for the cSAR3D system in the (x, y , θ)-subcube of the configuration space.It is shown along with the histogram (bars) of the sample lags in the underlying bins.

FIGURE 9 .
FIGURE 9. Results obtained for the reduced cSAR3D dataset when applying Algorithm 1 in the targeted search for critical cases.The outcomes for various values of the sensitivity parameter p are shown on the (x, y )-surface, while encoding θ in color.

FIGURE 10 .
FIGURE 10. Results obtained for the reduced cSAR3D dataset when applying both algorithms of the targeted search for critical cases.Shown are the outcomes of Algorithm 1 (top; see as well Fig. 9), and of Algorithm 2 (middle and bottom) for an error threshold (0.15, 0.2, or 0.25 dB) which is well below the typical MPE of 1.5 dB or more.

FIGURE 11 .
FIGURE 11.Isotropic semivariogram of the cSAR3D (blue) and DASY8 (red) GP models, showing the empirical (dots) and fitted (lines) semivariances along with the histograms (bars) of the binned sample lags.The fit quality for the cSAR3D and DASY8 models returned an NRMSE of about 10 % and 15 %, respectively.

FIGURE 12 .
FIGURE 12. QQ-plot of the model residuals (dots) and linear fit (lines) for cSAR3D (blue) and DASY8 (red) shown in comparison to the ideal fit (black line).The Shapiro-Wilk test p-value, as well as the location and scale of the linear regression are within their acceptance ranges, confirming the GP model.

TABLE 1 .
44 identified critical configurations for the cSAR3D system