A Robustness-Based Confidence Measure for Hybrid System Falsification

Verification of hybrid systems is very challenging, if not impossible, due to their continuous dynamics that leads to infinite state space. As a countermeasure, falsification is usually applied to show that a specification does not hold, by searching for a falsifying input as a counterexample that refutes the specification. A falsification algorithm exploits the quantitative robust semantics of temporal specifications, which provides a numerical robustness that tells how robustly a specification holds or not, and uses it as a guide to explore the input space toward the direction of robustness descent—once negative robustness is observed, it indicates that a falsifying input is found. However, if a falsification algorithm does not return any falsifying input, a user is not sure whether the specification does indeed hold, or there exist counterexamples that the algorithm did not manage to reach. In this case, a measurement on how likely there indeed exists no counterexample in the input space is necessary for better understanding the safety of the system and deciding whether more budget should be allocated for the falsification. To this end, we propose a confidence measure that assesses the likelihood that the system is not falsifiable, i.e., how confident a user should be that a specification holds, given the fact that an algorithm has sampled a set of inputs but did not find any falsifying one. The confidence measure is defined in terms of a coverage criterion of the input space that assesses to which extent the whole input space is explored and a local area is exploited where low robustness is observed. Experiments on commonly used falsification benchmarks show that our proposed confidence measure is reasonable and can distinguish different specifications.


I. INTRODUCTION
C YBER-PHYSICAL systems (CPSs) combine physical systems that have continuous dynamics with discrete digital controllers that have discrete dynamics: such a combination is referred to as hybrid systems. Formal verification of hybrid systems is challenging, if not impossible, due to their continuous dynamics that lead to infinite search spaces. So, their verification has been turned into a testing problem in terms of falsification, i.e., trying to find a counterexample that violates the specification. Given a model M that takes an input signal u and outputs a signal M(u), and a specification ϕ written as a temporal formula, the falsification problem consists in searching for a falsifying input, i.e., an input signal u such that the corresponding output M(u) violates ϕ.
When a falsification algorithm terminates, it either returns an input violating the specification or a failure result if no such input has been found. In this latter case, we do not know if it is because the specification is really not falsifiable, or the input space (which is our search space) has not been explored enough. Although it is practically impossible to witness the validity of specifications, it would still be helpful for practitioners if there were a surrogate measure that quantifies the likelihood that the specification is indeed unfalsifiable. Let us call such a notion a confidence measure.
One idea to define such a measure is to rely on the notion of coverage, which is widely used in testing as a proxy measure of test sufficiency. Specifically, given a test suite (i.e., the set of test inputs sampled by a falsification algorithm) that does not contain a falsifying input, the coverage ratio can be seen as the degree of confidence provided by the test suite (e.g., the coverage of a uniform partition of the input space [17]).   One limitation of such a coverage-based confidence measure is that it is agnostic on the quantitative information from robustness values. The following example illustrates the necessity of incorporating such information: imagine we are testing an automotive system [as shown in Fig. 1(a)] through the test runs with 15 fixed-length (say 60 s) input signals (i.e., test cases), and the statistical information about the speed in each of the 15 runs is given in Fig. 1(b). As we see, through the test, there is only one test case for which the highest speed of the system is 79 km/h, and we may be pretty confident on the validity of the specification "for any input, the speed at any moment within the time duration is below a bar 100 km/h." However, if the bar of the speed is 80 km/h instead of 100 km/h, then we will be less confident on its validity, because now the highest observed speed is so close to the bar. Also, suppose we instead require: "after the first 10 s, the speed at any moment within the time duration is above 30 km/h," and as we see in Fig. 1(b), there are ten different input signals such that the lowest speed during this time period is ≤31 km/h. Then, we will be even less confident on the validity of the specification, because we observe "dangerous behaviors" of the system in more test cases. 1 Table I summarizes the situation of the example in terms of robustness. The situation illustrates the case in which we should have different confidence levels for different specifications, even for a fixed test suite. In other words, a confidence measure should distinguish these specifications in terms of robustness-for example, the implication of the confidence in Table I would be that the first specification seems indeed unfalsifiable, while the second and the third are potentially falsifiable. To the best of our knowledge, any existing coverage 1 In this argument, we tacitly assume that the size of the test suite is small enough compared to that of the input space. This assumption is necessary: for example, if we perform exhaustive testing and find no falsifying input, then we are absolutely sure on the specification validity no matter how low the minimum robustness is. This assumption is reasonable, as otherwise, the input space should be so small that exhaustive testing is feasible; in this case, we would not need any surrogate measure of confidence. criterion cannot evaluate these specifications in such a way because it is agnostic on the robustness of tests.
Contributions: In this article, we propose a coverage-based confidence measure that incorporates the robustness of the test inputs. We claim that the following intuition underpins the argument in the above example: if we observe low robustness for a given input, then we are not confident on the specification validity until we examine the neighborhood of the input more in detail. We integrate this idea into a conventional coverage notion as a nonuniform partition policy of the input space, which is guided by the robustness values of the inputs of the test suite: inputs with lower robustness require to partition more finely the surrounding area, while for inputs with higher robustness the required partition is coarser. The confidence level is then given by the proportion of the covered subspaces over the input space, similar to conventional coverage criteria.
The adequacy of our confidence measure is assessed by a series of experiments. Specifically, we examine the following desired properties of the measure.
1) Ability to Distinguish Specifications: As demonstrated in the example above, our confidence measure is expected to distinguish specifications according to their robustness. Table I suggests that, in order for the confidence to achieve such an ability, the following relationships between the confidence and robustness are expected: for a given test suite. a) if the test suite contains an input with low robustness, then the confidence provided by that test suite is also low; and b) if the test suite contains a large number of inputs with robustness below a given threshold, then the confidence provided by that test suite is low. By comparing 21 different specifications, we verify that our measure exhibits these desired properties. 2) Monotonicity: In conventional notions of coverage, monotonicity is recognized as one of the fundamental desired features [18]. Here, monotonicity of coverage means that: given two test suites 1 and 2 such that 1 ⊂ 2 , the coverage ratio provided by 2 is higher than that provided by 1 . We expect our confidence measure to possess this property because it is intended to measure test sufficiency. While there is a case where this property does not hold (e.g., when significantly low robustness is observed in 2 \ 1 ), we empirically show that such behavior is exceptional, and our confidence exhibits monotonicity in most cases. In addition to examining these properties, we assess the adequacy of the parameter heuristics we use in the experiments. We also compare the level of confidence provided by several different falsification algorithms.
Paper Structure: Section II reviews related work and Section III introduces the necessary background. Section IV motivates the need for the confidence measure, and Section V formally presents the proposed confidence measure. Then, Section VI describes the design of the experiments conducted to assess the effectiveness of the measure, and Section VII presents the experimental results. Section VIII discusses threats that may affect the validity of the approach and how these have been addressed. Finally, Section IX concludes this article.

II. RELATED WORK
Our confidence measure is computed by partitioning the input space into finite subspaces. Covering array [19], [20], [21], [22] is a well-known testing notion that allows selecting a small subset of the input space that effectively represents the whole space. Several quantitative extensions of covering array are known [20], [21], but they treat the weight of variables as fixed constants that are expected to be given by domain experts. To the best of our knowledge, our notion is the first that dynamically updates the measure over the input space according to the knowledge from what has been tested so far.
By following the same line of research of this article, in [23], we have proposed a confidence measure based on the Gaussian process (GP). That approach consists in building a GP as a surrogate model of the robustness function, and using this to estimate the probability that no falsifying input exists. Note that the confidence measure is not directly comparable with the one proposed in this article, and so a comparison of the two measures is challenging; indeed, the absence of a ground truth makes it difficult to assess which measure is more trustworthy.
Some coverage criteria have been proposed for STL, that we use as a specification language. Bartocci et al. [24] proposed a technique that transforms the specification into a symbolic automaton, and then computes the coverage of the automaton, specifically location and transition coverage. On top of this, they propose a test generation approach aiming to maximize the coverage of the specification. Our confidence measure is based on the coverage of the different set, namely the input space (see Algorithm 1 and Definition 8). If a test goal (i.e., a location or a transition of the symbolic automaton) is not covered, specification coverage does not tell whether the input space has been covered enough (and so the test goal is unfeasible), or some areas have been not explored enough (so possibly containing inputs covering the test goal). So, specification coverage does not subsume our confidence measure. Note that our confidence measure does not subsume specification coverage either. Consider this example. It could be that, we cover a partition P with a specific input i, and there is another input j in P that is necessary for covering a test goal for specification coverage: in this case, our criterion does not require to have input j (as it would be redundant), and so does not guarantee to cover the test goal of the specification.
Matinnejad et al. [25] proposed two diversity metrics for the output signals of a Simulink model, that consider the whole signals or some specific features. Then, they propose a test generation algorithm to maximize the diversity of the outputs. The underlying idea is that observing diverse outputs should guarantee better fault detection; as such, the proposed approach could be used for falsification. However, it is not easy to directly use output diversity to define a coverage criterion, as the possible set of output signals is not known in advance. Note that our confidence measure should correlate with output diversity: indeed, the confidence should be high when the test suite generated for falsification has covered different types of output signals without finding a violation. In a similar way, Dokhanchi et al. [26] proposed coverage metrics Algorithm 1 Computation of the Robustness-Based Confidence Measure Input: a d-dimensional input space X 0 Input: a test suite 0 generated with a falsification algorithm Input: the robustness function R : 0 → R Input: a weight function W 1: if min p∈ R(p) ≤ 0 then 2: 6: function COVEREDSUBS(X, , R, W) 7: X ← ∅ set of covered subspaces 8: if = ∅ then 9: w ← max p∈ W(R(p)) required depth 10: if w > depth(X) then X needs further partition 11: X set ← P(X) new subspaces 12: for X ∈ X set do 13: X ← ∩ X 14: X ← X ∪ COVEREDSUBS(X , X , R, W) 15: else X needs no further partition 16: X ← {X} 17: return X on the output signals and use the coverage of these to guide the falsification. So, also in this case, the main difference with our approach is that we consider the input space. Similarly, Dreossi et al. [27] defined a notion of coverage on the output space for guiding falsification.
Other works use coverage in falsification to formalize the exploration degree. Tuncali et al. [28] generated a combinatorial test suite for covering all the combinations (of a given strength) of some discrete partitions of the input space; then, they perform falsification from the most promising partition. While their approach also partitions the input space, it differs in several aspects from ours. First of all, the partition is uniform in their approach while ours is adaptive, depending on the observed robustness values (see the function COVERED-SUBS in Algorithm 1). Since the approach in [28] is meant for test generation, it uses t-wise coverage that requires few tests for covering all t-wise combinations; in our measure, instead, since we want to provide a trustworthy measure, we check the coverage of all the n-wise partitions, being n the number of dimensions of the input space.
Adimoolam et al. [7] proposed a sampling guidance technique based on a nonuniform input space partition via robustness. They iteratively split the input space so that the inputs with high and low robustness are separated into different subspaces. Their technique of nonuniform partition is different from ours because it is not intended as a coverage criterion, and it is not clear how it can be used as a coverage criterion.
In our approach, we had to evaluate the effectiveness of the proposed measure, defined in terms of a coverage criterion. In software testing, a classical approach for evaluating the effectiveness of a coverage criterion is assessing its ability to capture faults [29]. Given a test suite T guaranteeing a coverage criterion C for a program P, different faults are injected in P; the mutation score measures the percentage of faults that are killed by T, i.e., the faults for which at least one test in t fails [30]. While classical mutation score is computed on artificially seeded faults, recently, repository of real faults have been made available. For example, Hemmati [31] evaluated the efficiency of statement, branch, MC/DC coverage, loop, and all-control flow coverage on five projects extracted from Defects4J [32], a benchmark of real faults for Java. Different works tried to assess whether artificial faults are a good approximation of real faults, e.g., [33]. Differently from all these works, we do not need to inject faults, because we can rely on the minimum obtained robustness (that tells how close a specification is to be falsified) and check whether this correlates with confidence.

III. PRELIMINARIES
This section reviews the background on optimization-based falsification.

A. System Model and Specification
We review signal temporal logic (STL) [34] and its robust semantics [1], [2]. Var is the set of variables. Variables stand for physical quantities, control modes, etc.
Definition 2 (STL Syntax): In STL, atomic propositions and formulas are defined as follows, respectively: α:: ≡ f (x 1 , . . . , x n ) > 0, and ϕ:: Here, f is an n-ary function f : R n → R, x 1 , . . . , x n ∈ Var, and I is a closed nonsingular interval in R ≥0 , i.e., Other connectives like ∨, →, , 2 I (always) and 3 I (eventually), are introduced as abbreviations: where c ∈ R is a constant, are also accommodated by using negation and the function f (x 1 , . . . , Definition 3 (STL Robust Semantics [2]): For an unbounded n-dimensional signal w : R ≥0 → R n and t ∈ R ≥0 , w t denotes the t-shift of w, that is, w t (t ): = w(t + t ). Let w : R ≥0 → R |Var| be a signal, and ϕ be an STL formula. We define the robustness w, ϕ ∈ R ∪ {∞, −∞} as follows, by induction. Here, and denote infimums and supremums of real numbers, respectively The robustness w, f ( x) > c stands for the vertical margin f ( x)−c for the signal w at time 0. A negative robustness value indicates how far the formula is from being true. The original semantics of STL is Boolean, given by a binary relation |= between signals and formulas. The robust semantics refines the Boolean one, in the sense that: w, ϕ > 0 implies w |= ϕ, and w, ϕ < 0 implies w |= ϕ. Optimization-based falsification via robust semantics hinges on this refinement, as explained in Section III-B.

B. Optimization-Based Falsification
The falsification problem is a search problem to find an input signal u that violates a given STL formula ϕ. In practice, generating such a continuous signal is infeasible. Hence, practitioners adopt parametrized representations that use a finite number of parameters to identify a signal. In this article, we adopt piecewise constant signals as such a representation, defined as follows.
The hyperparameter k is referred to as the number of control points.
By Definition 4, a piecewise constant signal u is identified by m · k real parameters. As a typical setting, we also assume that the range of u is in a hyperrectangle H = D 1 × · · · × D m , where D 1 , . . . , D m ⊆ R are closed intervals. By abuse of notation, we canonically identify the set of piecewise constant signals with k control points and its range H with the following (mk)-dimensional hyperrectangle X, which we call an input space The falsification problem is transformed into an optimization problem accordingly: given a system model M, whose input signals are bounded by X, and an STL specification ϕ, the falsification problem can be solved via the following optimization: The symbol of success in solving the falsification problem is the existence of an u such that M(u), ϕ < 0.
Many solvers can be used to solve this optimization problem. The most naïve one is uniform random sampling. Improved samplers include metaheuristics-based algorithms, such as hill-climbing optimization algorithms. They suggest a new sample based on the sampling history, and the principle is to drive the objective downward as in Fig. 2. The state-of-the-art falsification tools, such as Breach [15] and S-Taliro [6], are mainly built on these samplers (see [14] for a comparison of performance). Besides that, a hierarchical sampler that combines Monte Carlo tree search (MCTS) and hill-climbing optimization algorithms is proposed in [16]. MCTS [35] is a reinforcement learning-based tree search framework. It explores the search space initially, and then maintains an asymmetric tree by biasing search budgets toward "promising" branches. In the following, we will call test suite the set of inputs sampled by a given falsification algorithm. We will compare the different algorithms in terms of the confidence measure in Section VII (Fig. 7).

IV. PROBLEM DEFINITION AND MOTIVATING EXAMPLE
In this work, we consider the confidence estimation problem of hybrid system falsification: i.e., when no counterexample is observed by falsification, how likely it is that there exists no such counterexample in the whole input space. Concretely, our problem is the following: Definition 5 (Confidence Estimation Problem):] The confidence estimation problem is defined as follows.
1) Given: a system model M, a test suite for M, and an STL formula ϕ. 2) Return: an estimate of the likelihood that M(u), ϕ > 0 holds for any input signal u (i.e., ϕ is unfalsifiable). An answer for the problem is called an (estimated) confidence on the validity of ϕ provided by .
In software testing, such confidence is usually estimated by defining a coverage notion: the higher coverage a test suite achieves, the higher confidence a user can obtain. Various coverage definitions have been proposed, e.g., structural coverage criteria, such as statement coverage, branch coverage, and MC/DC, or input space coverage criteria such as star discrepancy [27]. In our context, the former approach is not feasible due to the black-box nature of our system model; therefore, we focus on coverage of the input space, specifically an input space partition-based approach [7]. There, the input space is partitioned into finite subspaces, and the coverage ratio by a test suite is defined as the total relative volume of subspaces that intersect with the test suite.
A partition of an input space provides guidance on how to sample from the different subspaces, i.e., it requires to sample an input from each of its subspace so that the full coverage is achieved. Conventionally, the size of subspaces is uniform, which means the contribution of any input to cover the input space is the same. However, for confidence estimation in the presence of robustness, it should not be-those regions of the input space where only high robustness values have been observed are not likely to contain counterexamples, while those regions where low robustness values have been observed are suspicious. Therefore, the suspicious regions require more falsification efforts (i.e., samples) to ensure that they are sufficiently explored. We formalize such a biased sampling requirement by a nonuniform partition of the input space; the formal definition is introduced in Section V, and an intuitive illustration is given in Fig. 3(e).
The figure shows an example of our nonuniform partition approach. In Fig. 3(a), the x-and y-axes together form an input space X, and the z-axis denotes the robustness of points in X. In Fig. 3(c), the black dots denote a set of samples in X. In Fig. 3, the space X is partitioned, and the shadowed areas are those that are covered by the samples. Note that the partition of X in Fig. 3 is not uniform: the subspace [20,40]×[0, 20] (the bottom-right quarter area) is partitioned with a more fine-grained granularity than [20,40] × [20,40] (the top-right quarter area), because lower robustness values are observed in that subspace. As a consequence, [20,40] × [0, 20] requires more samples than [20,40] × [20,40] to be considered covered.

V. ROBUSTNESS-BASED CONFIDENCE MEASURE
We here give the technical details of our confidence measure. We first introduce necessary notions, and then give an algorithm that computes the confidence provided by a given test suite (Algorithm 1).
We fix the following through this section: a system model M that takes input signals u : [0, T] → R m ; the number k of control points and the parameter bound D j (j ∈ {1, . . . , m}) of piecewise constant signals; and an STL formula ϕ over the output space of M. The input space, as defined in (1), is denoted by X 0 . By abuse of notation, an element p of X 0 is also used to denote the corresponding piecewise constant signal u : [0, T] → D 1 × · · · × D m with k control points, and vice versa.

A. Binary Partition
We first introduce the notion of binary partition as the basis of our nonuniform partition. Here, a (finite) partition of a set Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

B. Weight Function
We next introduce a way to construct a nonuniform partition according to a given test suite, i.e., a finite subset of the input space X 0 produced by a falsification algorithm. Observe that each point p in is associated with the robustness value R(p) = M(p), ϕ , i.e., the robustness of the input signal p with respect to the STL specification ϕ. We translate this information into a partition of X 0 via a nonincreasing function W : R >0 → N, which we call a weight function.
The intuition behind the definition of W is explained in terms of depths that subsets of X 0 belong to: we say a set X ⊆ X 0 is at depth n ∈ N if X ∈ P n (X 0 ). The weight function W represents the following partition policy of X 0 : for a given p ∈ , if we observe R(p) = r ∈ R >0 , then the subspace in the partition that contains p should be at depth W(r) (or possibly deeper, depending on the robustness of other elements in ). The nonincreasing condition is required to W so that it embodies the intuition discussed in Section IV: the closer to zero the observed robustness is, the finer we would like to partition the neighborhood of the corresponding input. Naturally, we define the following inversely proportional weight function W as an instance of W.
Definition 7 (Inversely Proportional Weight Function): The function W : R >0 → N, which we call an inversely proportional weight function, is defined by W(x) = (c/x) : here, c ∈ R >0 is a positive real constant, and stands for the floor operation. We call c the weight parameter.
Example 2: Let W be the inversely proportional weight function with the weight parameter c = 10. Then, in Fig. 3, we observe that the sample in the subspace X 3 = [10, 20) × [20, 30), say p, has its robustness R(p) = 4; then the weight function W tells us that the subspace in the partition that contains p should be at depth W(4) = (10/4) = 2 or deeper, and indeed, the subspace X 3 is at depth 2. Similarly, the sample in the subspace X 4 = [0, 10) × [30,40], say q, has its robustness R(q) = 6; then W tells us that the subspace in the partition that contains q should be at depth W(6) = (10/6) = 1 or deeper. This time, X 4 is at depth 2, which is deeper than the required depth by W; this is due to the stronger requirement posed to another input in the test suite, namely, p.
Remark 1: Users of our confidence measure can determine the weight parameter c according to the size of the system, available computation resources, and/or their domain knowledge. Concretely, users can fix the following parameters.
1) The default partition P n (X 0 ) of the input space, i.e., a partition that looks reasonable for users when the robustness values of the tests are always "high enough." 2) The robustness threshold r ∈ R >0 for the default partition, i.e., the lower bound of robustness that users feel "high enough." Then, they can let c = n · r. By this, the input space is partitioned finer than the default one when the minimum robustness is significantly below r, while the partition remains the default one when the robustness of inputs is always around r.
The parameter n can be decided according to the size of the system and available computation resources, similar to other partition-based coverage notions [7], [17]; r can be decided according to users' domain knowledge. For example, if M models an autonomous driving system and ϕ requires to avoid hitting obstacles, then letting r be around 1-5 m may be sensible. In our experiment, for the sake of a fair comparison, we determine r by the following heuristics: for each model and each specification, we randomly sample a small number of inputs, and let r be the average robustness of these inputs.

C. Confidence Computation
Now, we introduce an algorithm (see Algorithm 1) to estimate the confidence C( 0 ) on the validity of the specification ϕ provided by the test suite 0 generated with some falsification algorithm. Recall that in the confidence estimation problem (Definition 5) we are given the following elements: a system model M, a test suite 0 , and an STL formula ϕ. From these, we have the robustness function R : 0 → R defined by R(p) = M(p), ϕ for each p ∈ 0 . We additionally fix: 1) the input space X 0 of M that contains 0 and 2) a weight function W (e.g., the inversely proportional one in Definition 7). Given these, Algorithm 1 computes our robustness-based confidence measure.
Definition 8 (Robustness-Based Confidence Measure): Let M, 0 , ϕ, R, X 0 , and W be as above. The robustness-based confidence (with respect to X 0 and W) on the validity of ϕ provided by 0 is the value C( 0 ) computed by Algorithm 1 with the inputs X 0 , 0 , R, and W.
The high-level idea of Algorithm 1 is first to compute a nonuniform partition of the input space X 0 that satisfies the requirement by the weight function W, and then let C( 0 ) be the total relative volume of subspaces in the partition that intersects with (i.e., that are covered by) 0 . In reality, one can directly compute the set of covered subspaces, as done by the function COVEREDSUBS. A step-by-step explanation of the algorithm is as follows.
1) The algorithm first checks if 0 contains a falsifying input (line 1), and if there is, then it returns zero confidence (line 2). Otherwise, the partitioning process is invoked by a call to the function COVEREDSUBS with arguments X 0 , 0 , R, and W as in line 4. 2) The function COVEREDSUBS takes effect by initializing an empty set X (line 7) to collect the covered subspaces. Then, the function checks if (i.e., the set of tests included in X) is empty (line 8): X is updated only if is not empty; otherwise, an empty set is returned.

3) If
is not empty, the algorithm checks whether the weight function requires a further partition of X or not. Specifically, if the maximum depth w required by W for the elements of (line 9) is larger than the current depth of X (line 10), then it means that X is not sufficiently partitioned, and thus the algorithm goes through further partitioning. To this aim, it identifies the set X set of new subspaces to be covered (line 11); for each X of these subspaces (line 12), it identifies the tests X in covering X (line 13), and performs a recursive invocation of COVEREDSUBS with arguments X , X , R and W (line 14). 4) When the space partition is fine-grained enough (i.e., the guard at line 10 is false), no further action is needed, and recursion stops (line 16). 5) As a result of line 4, a set X 0 of 0 -covered subspaces of X 0 is generated. Lastly, the algorithm lets the confidence C( 0 ) be the relative volume of ∪X 0 (line 5). Example 3: Let the partition in Fig. 3 be the nonuniform partition produced by the following elements: [0, 40] × [0, 40] as the input space, the set of samples in Fig. 3(c) as the test suite, the robustness landscape in Fig. 3(b) as the robustness function, and the inversely proportional weight function with c = 10 as the weight function. Then the confidence is computed as (1/4)

VI. EXPERIMENTS DESIGN
We validate our proposed confidence measure through a series of experiments. In the following, we introduce the research questions and the setup of the experiments conducted to answer them.

A. Research Questions
Below, we report the research questions we identified to assess the proposed measure.
RQ1: Can the confidence measure distinguish different specifications according to their robustness? It has been shown in Table I that, for a given test suite, a confidence measure should give different confidence levels for specifications that have different robustness statistics. In other words, a confidence measure should distinguish different specifications according to robustness. In particular, we argued in Section I that the following relationships between confidence and robustness should be observed: for a given test suite, the confidence on the validity of a specification is lower if a) the minimum robustness among the inputs in the test suite is smaller, or b) there are more inputs in the test suite that has a low robustness. To answer this question, we compute the confidence on various specifications provided by randomly sampled, fixed-size test suites and check if the desired relationships are observed. RQ2: Is the confidence measure monotone? One of the properties we naturally expect our measure to have is monotonicity with the number of test cases: the more nonfalsifying inputs we observe, the higher our measure should be, i.e., the more we should be confident on the validity of the specification. To answer the question, we compute the confidence provided by test suites of different sizes. 2 RQ3: Does our heuristics give us a reasonable value of the parameter for the weight function? In our experiment, we give the weight parameter of the weight function with reference to the mean robustness of a few random inputs (see Remark 1). This heuristics should provide a reasonable choice of the parameter. To answer the question, we compute confidence using different reference values in addition to the mean value, as described in Section VI-C. RQ4: What is the confidence level achieved by existing falsification algorithms? Test suites with the same size can give a different confidence level depending on how they distribute in the input space. Meanwhile, existing falsification algorithms have several different emphases on exploration and exploitation. We would like to know how such a difference affects the confidence provided by the generated test suites.
To answer this question, we generated test suites by three different algorithms listed in Section VI-D, namely, Random, CMAES, and MCTS-CMAES.

B. Models and Specifications
We selected four Simulink models that are usually used as benchmarks in the falsification community [14], [36], [37], and the code project F16 aircraft available on GitHub 3 [38]. In total, these five benchmarks introduce 21 STL specifications. Below we provide the details of the benchmarks and the specifications.
1) Automatic transmission (AT) model [39] was proposed as a benchmark for falsification in [39]. It has two input signals, throttle ∈ [0, 100] and brake ∈ [0, 325] and several output signals including speed, rpm, gear, etc. Specifications AT{1∼6} state the desired safety properties of the system in different aspects. 2) Abstract fuel control (AFC) model [40] takes the input signals Pedal_Angle ∈ [8.8, 90] and Engine_Speed ∈ [900, 1100], and outputs the controller mode subject to Pedal_Angle and a ratio μ reflecting the deviation of airfuel-ratio from its reference value. The ratio μ is desired to be small so that the safety of the car is guaranteed. These properties are formalized by the specifications AFC{1,2}. 3) Neural network controller (NN) [14] is a magnetic levitation system from MathWorks. The safety specification NN1 concerns with the position of the magnet with respect to its reference value Ref ∈ [1,3], with Ref being the only input signal. 4) Free floating robot (FFR) [41] is a robot vehicle powered by four boosters, i.e., the four input signals u 1 , u 2 , u 3 , u 4 ∈ [−10, 10]. It moves in two spatial dimensions x and y, and the goal is to avoid reaching the area surrounding the point (4, 4) at a certain velocity, as stated by the specification FFR1.

5) F16 [38] The model is inspired by a real F16 Ground
Collision Avoidance System, and now is a code project that has about 8 400 lines of code (in which more than 3 500 lines of code are about the main functionality of the system). The system has three initial parameters, namely, roll ∈ [0.2π, 0.2833π ], pitch ∈ [−0.4π, −0.35π ] and yaw ∈ [−0.375π, −0.125π ], and the system requirement is that the aircraft should never hit the ground during the simulation, expressed in F16 * in Table II(b). For AT, AFC, NN, and FFR, we consider two versions of each specification, namely falsifiable and unfalsifiable 4 ones; falsifiable ones are marked with * . They are the same, except that some of the parameters in the specification are different. F16 * is known to be falsifiable, but hard to falsify [14]; note that there is no canonical unfalsifiable modification of it, and so we only have the falsifiable version of it. Concrete descriptions of specifications are in Table II. In AT, AFC, NN, and FFR, we set the number of control points as 3; F16 has system parameters rather than time-variant signals as inputs, so there is no need to set its control points.

C. Weight Function and Weight Parameter
We use the inversely proportional weight function (Definition 7) to compute the confidence. We decide the weight parameter c by specifying the default partition P(X 0 ) n and the robustness threshold r, as explained in Remark 1 (recall we let c = n · r). In our experiment, n is the number such that |P(X 0 )| n = 2 6 , and the number of random inputs is 10.
In RQ1, due to the limited numbers of specifications for AFC, NN, FFR, and F16, it is difficult to examine the desired relationships by only looking at the result for a fixed system model. To take care of this issue, we examine them among  all 21 specifications for the five models through a certain normalization. Concretely, for a fixed specification and parameters n and r, we define the normalized robustnessR(p) of p bỹ R(p) = r −1 ·R(p). Also, for a given test suite and p ∈ , we say the robustness R(p) of p is low whenR(p) ≤ 0.5 holds; in such a case, we call p a low robustness witness. In RQ1, we examine the desired relationships between confidence and these normalized statistics. This normalization is justified by observing that r is the threshold of robustness for the default partition P(X 0 ) n . In particular, we have W(R(p)) ≥ n + 1 when R(p) is low, and thus the neighborhood of p is partitioned finer than the default partition.
In RQ3, the minimum robustness r min and the maximum robustness r max of these inputs are also used as the value of r, in addition to the mean r mean . Concrete values of r min , r mean , r max , and n are listed in Table III.
Remark 2: As explained above, we let the parameter n be such that |P(X 0 )| n = 2 6 , intending to equalize the size of the default partition. However, for the model FFR, there is no natural number n that satisfies the equation. Indeed, the coarsest nontrivial partition of its input space, namely P(X 0 ), already has the size 2 12 . While we let n = 0.5 for FFR as a provisional solution, we were still unable to make a fair comparison between the results for FFR and other models. Overcoming this problem is one of our future works; a possible extension of our confidence measure is discussed in Section IX.

D. Falsification Algorithms and the Size of Test Suites
We use the following algorithms that put different emphasis on exploration and exploitation. 1) Random sampling (Random) is the naïve sampling strategy that generates samplings randomly distributed in the search space, which is known as a strategy of pure exploration. 2) Covariance matrix adaptation evolution strategy (CMAES) [42] is a well-known evolutionary strategybased stochastic algorithm. It employs a method named covariance matrix adaption to approximate the objective function and suggest a new sampling. It is known as exploitation-based. 3) MCTS with CMAES (MCTS-CMAES) [16] aims at solving the local optimum trap of hill-climbing optimization algorithms such as CMAES. It adds a high layer of MCTS to explore the space and suggest promising directions, and employs hill-climbing optimization algorithms to exploit at low layers. Random is used in all experiments as the default algorithm; the other two are used in RQ4, where we compare the confidence provided by different falsification algorithms. 5 In RQ1, RQ3, and RQ4, the size of test suites is 500; in RQ2, we 5 In our experiment, CMAES and MCTS-CMAES found an input with a lower robustness than Random 85.5% and 92.7% of the time, respectively. generate test suites of different sizes (from 25 to 500, with increments of 25 tests). To account for the randomness of the sampling algorithms, we repeated each experiment five times.

E. Experiment Platform
Regarding the implementation of the falsification algorithms, in the experiments we use Breach (version 1.2.13) [15] for CMAES, FalStar for MCTS-CMAES [16], and a random approach implemented by us [43]. We use MATLAB R2021b for the benchmarks simulation. Experiments have been executed on a Windows machine with 8.00 GB RAM and Intel Core i5-8350U 1.70 GHz processor. The code and all the experimental data are reported in the online repository [43].

VII. EXPERIMENT RESULTS
In this section, we present the experimental results and we answer the research questions introduced in Section VI-A.

RQ1: Can the Confidence Measure Distinguish Different Specifications According To Their Robustness?
The results for this RQ are shown in Table IV and Fig. 4. Table IV shows, for each specification, the mean value of confidence, normalized minimum robustness, and the number of low robustness witnesses. Fig. 4 plots the same data, where each specification is given a different combination of the color and the symbol.
In Fig. 4(b) we clearly observe our expected relationshipa negative correlation between confidence and the number of low robustness witnesses-when we exclude FFR (whose singularity is due to the problem explained in Remark 2). As the number of low robustness witnesses increases from 0 to around 230, the confidence decreases from 1 to around 0.35 almost linearly. In Fig. 4(a) we observe a moderate tendency we expected: as the minimum robustness of the inputs in the test suite decreases, the confidence also decreases. Unlike the linear correlation in the case of low robustness witnesses, what we observe here is a transition of the range of the confidence, that is, roughly the following holds. 1) Confidence is 1 when the minimum robustness is 0.5 or higher. 2) It is around 0.7-0.95 when the minimum robustness is in the interval [0.125, 0.5].
3) It is about 0.7 or lower when the minimum robustness is less than 0.125. Here, we note that the decrease of robustness affects confidence only when it makes the value of the weight function larger. This observation is coherent with what we witness in Fig. 4(a)-0.5 and 0.125 are the thresholds of robustness such that the value of the inversely proportional weight function changes from 1 to 2 and 3 to 4, respectively. In particular, when the minimum robustness is above 0.5, the input space is never partitioned finer than the default partition, which is a uniform partition with the size 2 6 = 64 in our experiment. Almost full confidence is achieved in this case, as 500 random samples are usually more than enough to cover this.
In terms of actual falsifiability, what we observe in Fig. 4 can be understood as follows: if the specification is falsifiable, and if there is evidence to doubt the validity, then our measure should tell the user that by giving low confidence. This is yet another desired property of a confidence measure, as otherwise, the users would falsely believe that the specification is valid. 6 We note that the falsifiable specifications in our benchmark are not easy to falsify for random sampling; indeed, it was unsuccessful in finding a falsifying input after 2 500 samples in total (5 instances of 500 random samplings). Nevertheless, we observe the confidence on 5 falsifiable specifications (excluding FFR1 * ) is below or close to the lowest confidence on unfalsifiable specifications, which could be seen as an estimated threshold of confidence being low. Corresponding data for these specifications are labeled by Falsifiable-1 in Fig. 4. For these specifications, we observe the minimum robustness is below about 0.125, and the number of low robustness witnesses is above about 50. These numbers could be understood as a sufficient level of evidence to doubt the specification validity.
Answer to RQ1: Our confidence measure can distinguish different specifications according to the minimum robustness and the number of low robustness inputs. By this property, our measure gave low confidence value for 50% of falsifiable specifications (excluding FFR1 * ).

RQ2: Is the Confidence Measure Monotone?
The result is shown in Fig. 5. There, blue and green lines show the value for unfalsifiable and falsifiable specifications, respectively. The thin dotted lines show the values for 5 individual test suites, and the thick solid line shows their mean.
We observe that the confidence increases with respect to the test suite size in most cases (again, except for FFR, which is exceptional as explained in Remark 2). In some cases, such as AT1 * , AT6 and AT6 * , we see a slightly decreasing tendency; this is due to a high frequency of witnessing low robustness (see Table IV), in which case we can be more skeptical on the specification validity as we test more inputs.
Answer to RQ2: Our confidence measure is monotone most of the times. Slightly decreasing behavior is occasionally observed when we frequently witness low robustness.

RQ3: Does Our Heuristics Give Us a Reasonable Value of the Parameter for the Weight Function?
The result is given in Fig. 6(a) and (b), where the mean value among five repetitions is shown by the bars, and their distributions are shown by the error bars. Recall that the robustness thresholds given by minimum, mean, and maximum robustness of ten random inputs are denoted by r min , r mean , and r max , respectively (Table III). First, we observe the tendency that the confidence measure is more "cautious" when a larger parameter is used. This is a natural consequence of the definition of our weight function.
When we use r min , we observe that the confidence is uniformly close to 1 for all specifications (except for FFR). This means that the confidence measure cannot distinguish test results according to robustness. Also, notice that the difference of r min and r mean is moderate-r min is always in the range of 33%-100% of r mean . These results suggest that, for adequacy of our confidence measure, the parameter r should be around r mean or larger.
By using r max , we obtain the confidence measure which is more cautious, maintaining its adequacy. This suggests that one can also use a value between r mean and r max , depending on their desired quality of assurance: still, we note that the use of a larger r also increases the cost to compute confidence, whose growth rate is exponential with respect to r at worst.
Answer to RQ3: Our heuristics estimates the lower bound of the parameter for an adequate confidence measure. Larger values could also be used when the user wishes to give a stronger assurance.

RQ4: What Is the Confidence Level Achieved by Existing Falsification Algorithms?
The result is given in Fig. 7, where a bar shows the mean value and the error bar the distribution of the five repetitions. For this RQ, we only show the result for unfalsifiable ones because some of the falsifiable specifications are falsified by CMAES or MCTS-CMAES, in which case the comparison is not possible.
We observe that, in terms of confidence, algorithms with more emphasis on exploration tend to perform better. Random, which is purely exploratory, performs the best in most cases; CMAES, which is exploitation-focused, performs the worst; and the performance of MCTS-CMAES, which balances exploration and exploitation, is between the others.
Answer to RQ4: In terms of confidence, algorithms with more emphasis on exploration tend to perform better.

VIII. THREATS TO VALIDITY
We here discuss the threats that may affect the validity of our approach [44], by following the classical classification of construct, conclusion, internal, and external validity [44].
Construct Validity: One such a threat could be that the metrics used to evaluate the confidence metric are not appropriate. We expect that the confidence is lower when we observe low robustness values and, in particular, when several low robustness values are observed; therefore, RQ1 assesses whether there exists an inverse correlation between low robustness and the number of low robustness witnesses, and the confidence measure. Another expected property is monotonicity, and this is properly assessed in RQ2. The confidence measure depends on a parameter, for whose selection we provide a heuristics; RQ3 properly assesses whether this heuristic is suitable.
Conclusion Validity: The randomness of falsification algorithms could affect the results and so it must be considered in the evaluation. To account for this, we repeated each experiment (i.e., the execution of a given falsification algorithm with a given budget) for 5 times.
Internal Validity: One such threat is that the obtained results are due to other factors. For example, implementation errors could affect the computation of the confidence measure. To mitigate this threat, we have carefully inspected and tested the implementation.
External Validity: A threat of this type could be that the conclusions obtained for the current benchmarks do not generalize to other models and specifications. To mitigate this threat, we have used the benchmarks (four Simulink models and a real code project) that are used in falsification competitions (see Section VI-B) and that are considered representative in the falsification community.

IX. CONCLUSION
This article presented a confidence measure that utilizes quantitative information by the robustness of test inputs. Experiments demonstrated that our measure, which is defined upon the nonuniform partition of input spaces, suitably distinguishes different specifications and test suites.
As in Remark 2, we had difficulty applying our confidence measure when the system had a high-dimensional input space. One of our future works is to extend our confidence measure so that it can properly handle such systems. One possibility is to combine our measure with the known extensions of coverage for handling such input spaces, e.g., projection coverage [20].
Other measures than ours could be used to define confidence for falsification, e.g., the recent one based on GP [23], or coverage metrics that do not take into account robustness [7], [27]. However, the absence of a ground truth makes the comparison of these measures difficult. As another future work, we plan to devise some criteria to compare the trustworthiness of these different measures.
In this article, our robustness-based partition policy in Algorithm 1 is only used to compute the confidence. However, as in e.g., [26], coverage criteria can be seen as sampling policies that require sampling uncovered subspaces. It is yet another future work to construct a falsification algorithm out of our proposed partition policy.