Bayesian Experimental Design for Finding Reliable Level Set under Input Uncertainty

In the manufacturing industry, it is often necessary to repeat expensive operational testing of machine in order to identify the range of input conditions under which the machine operates properly. Since it is often difficult to accurately control the input conditions during the actual usage of the machine, there is a need to guarantee the performance of the machine after properly incorporating the possible variation in input conditions. In this paper, we formulate this practical manufacturing scenario as an Input Uncertain Reliable Level Set Estimation (IU-rLSE) problem, and provide an efficient algorithm for solving it. The goal of IU-rLSE is to identify the input range in which the outputs smaller/greater than a desired threshold can be obtained with high probability when the input uncertainty is properly taken into consideration. We propose an active learning method to solve the IU-rLSE problem efficiently, theoretically analyze its accuracy and convergence, and illustrate its empirical performance through numerical experiments on artificial and real data.


Introduction
In the manufacturing industry, it is often necessary to repeat operational testing of machine in order to identify the range of input conditions under which the machine operates properly. When the cost of an operational test is expensive, it is desirable to * Nagoya Institute of Technology † RIKEN Center for Advanced Intelligence Project ‡ National Institute for Materials Sciences § email:takeuchi.ichiro@nitech.ac.jp be able to identify the region of appropriate input conditions in as few operational tests as possible. If we regard the operational conditions as inputs and the results of the operational tests as outputs of a black-box function, this problem can be viewed as a type of active learning (AL) problem called Level Set Estimation (LSE). LSE is defined as the problem of identifying the input region in which the outputs of a function are smaller/greater than a certain threshold. In the statistics and machine learning literature, many methods for the LSE problem have been proposed [Bryan et al., 2006, Gotovos et al., 2013, Zanette et al., 2018.
In practical manufacturing applications, since it is often difficult to accurately control the input conditions during the actual usage of the machine, there is a need to guarantee the performance of the machine after properly incorporating the possible variation of input conditions. In this paper, we formulate this practical manufacturing problem as an Input Uncertain Reliable Level Set Estimation (IU-rLSE) problem, and provide an efficient algorithm for solving it. The goal of IU-rLSE is to identify the input region in which the probability of observing an output smaller than a specified threshold is sufficiently large, when the input uncertainty is taken into account. Figure 1 illustrate the basic idea of IU-rLSE problem.
We define the reliability of an input point as the probability of observing outputs smaller than a specified threshold, and the reliable input region as the subset of the input region in which the reliability is greater than a certain probability threshold (e.g., 0.95). Under the assumption that the prior distribution of the true function follows a Gaussian Process (GP), we propose a novel Bayesian experimental de-sign (c.f., active learning) method to identify the reliable input region in as few function evaluations as possible, and call the method the IU-rLSE method (with slight abuse of terminology). Specifically, we extend an acquisition function (AF) from an ordinary LSE problem so that the input uncertainty is properly taken into account, and develop a reasonable approximation of the AF for which expensive integral calculations are necessary unless our approximation is used. We theoretically analyze the accuracy and convergence of the proposed IU-rLSE method, and illustrate its numerical performance by applying the method to both synthetic and real datasets.
Related Work Machine learning problems for black-box functions with high evaluation cost have been studied in the context of active learning (AL) [Settles, 2009].
The problem of finding the global optimal solution for blackbox functions is called Bayesian Optimization (BO) [Shahriari et al., 2016]. In BO and related AL problems, Gaussian Process (GP) model is often used as a nonparametric and flexible model of black box functions. GP model was first used for LSE problem in [Bryan et al., 2006], where the authors proposed an AF based on Straddle heuristic. Then, [Gotovos et al., 2013] proposed a new AF based on GP-UCB [Srinivas et al., 2010] framework, and prove the convergence of the algorithm. Recently, [Zanette et al., 2018] proposed another new AF for LSE problem based on expected improvement of classification accuracy. LSE problems are also used in the context of safe BO [Sui et al., 2015, Sui et al., 2018. Furthermore, [Bogunovic et al., 2016] introduced a unified framework of BO and LSE problems. In order to obtain the predictive distribution of GP model under input uncertainty, integral calculations of the GP model over the input distribution is necessary. Integral calculation on GP models have been studied in various contexts [Girard et al., 2003, O'Hagan, 1991, Xi et al., 2018, Gessner et al., 2019. In the context of AL such as BO, there are some studies dealing with input uncertainty [Beland and Nair, 2017, Oliveira et al., 2019, Inatsu et al., 2019, but none of them consider the same problem setup as ours.
Contribution Our main contributions in this paper are as follows: • Assuming GP model as a prior distribution of the true function f , we formulate IU-rLSE problem, i.e., the problem of identifying the set of input points at which the probability of observing a response smaller/greater than a certain threshold is sufficiently high under input uncertainty.
• We propose an AL method for IU-rLSE problems. Specifically, we propose a novel AF which can be interpreted as an expected improvement for the IU-rLSE problem. Although naive implementation of this AF requires huge computational cost, we propose a computational trick to reasonably approximate the the expected improvement.
• We show the advantage of the proposed IU-rLSE method both theoretically and empirically.
Under reasonable assumptions, we analyze the accuracy and the convergence of the IU-rLSE method, and show that it has desirable properties. Furthermore, we demonstrate the effectiveness of the IU-rLSE method by performing numerical experiments both on synthetic and real data.

Preliminaries
Let f : D → R be a black-box function whose function values are expensive to evaluate, where D is a compact subset of R d . For each input x ∈ D, assume that a function value is observed as y = f (x) + , where ∼ N (0, σ 2 ) is an independent Gaussian noise. Let X be a set of finite points in D. Given a threshold h ∈ R, the goal of ordinary level set estimation (LSE) problem [Gotovos et al., 2013] is to identify the set of points x ∈ X such that f (x) ≤ h.
In this paper, we consider LSE problems under input uncertainty, which we call Input Uncertain Reliable LSE: IU-rLSE. In IU-rLSE problems, when one aims to evaluate the function f at an input point x ∈ X , one cannot actually observe f (x), but observe the function value f (s) for slightly different input point s ∈ D where s is a realization of a random

LSE
IU-rLSE Figure 1: An illustrative example of IU-rLSE problem. (a) An example of ordinary LSE problem. The two input points (blue stars) are considered as appropriate input points because the corresponding outputs are smaller than the desired threshold h. (b) and (c) Examples of IU-rLSE problem. In IU-rLSE problems, when a user specifies input points as indicated by bule stars, due to the input uncertainty, actual inputs are variated and hence the observed outputs are also variated as indicated by red crosses. In (b), the probability of observing outputs smaller than the threshold h (66%) is not sufficiently high, and thus the input point (blue star) is not considered as an appropriate input point when the variability is taken into consideration. On the other hand, in (c), the probability of observing outputs smaller than the threshold h (97%) is sufficiently high, and thus the input point (blue star) is considered as an appropriate input point even when the variability is taken into consideration.
variable S(x) whose density function is written as g(s | θ x ). We first assume that the density function g(s | ·) and the parameters θ x are both known, but later consider the case where θ x is unknown. The goal of IU-rLSE problems is to identify a set of points x ∈ X such that the probability P s∼g(s|θx) (f (s) ≤ h) is sufficiently high. Specifically, for each x ∈ X the above probability is written as For a given probability threshold α ∈ (0, 1), we define an upper set H and a lower set L on a subset X of D as The goal of IU-rLSE problem is to identify H with as few function evaluations as possible. Figure 2 illustrate the basic idea of reliable input region.

Gaussian Process
In this paper, to model the unknown function f , we assume Gaussian process (GP):GP(0, k(s, s )) as a prior distribution of f , where k(s, s ) : D × D → R is a positive definite kernel. Thus, for any finite points s 1 , . . . , s t , a joint distribution of its function values f t (s 1 ), ..., f t (s t ) is defined as (f t (s 1 ), . . . , f t (s t )) ∼ N t (µ t , K t ), where N t (µ t , K t ) is a t-dimensional normal distribution with mean vector µ t = (0, . . . , 0) ≡ 0 t and covariance matrix K t whose (i, j)th element is k(s i , s j ). From properties of GP, the posterior distribution of f after adding the current data {(s j (x j ), y j } t j=1 is also GP. Then, a mean, variance and covariance of the posterior are respectively given by where k t (x) = (k (s 1 (x 1 ) , x), . . . , k (s t (x t ) , x)) , C t = K t + σ 2 I t , y t = (y 1 , . . . , y t ) and I t is a Three examples of input points and their uncertainties. At each input point, the reliability p * x· is defined as the probability of observing outputs smaller than the threshold h when the input uncertainty is taken into account. c) The reliable input region with reliability threshold α is defined as the subset of the input region in which the reliability p * x is greater than α (e.g., α = 0.95). The goal of IU-rLSE problem is to identify the reliable input region as few function evaluations as possible.
t-dimensional identity matrix.

Proposed Method
In this section, we propose an efficient active learning method for IU-rLSE. First of all, we explain the difference between ordinary LSE and IU-rLSE. Figure 3 shows a conceptual diagram comparing LSE and IU-rLSE. In LSE, the purpose is to classify values of the function f . On the other hand, the purpose of IU-rLSE is to classify probabilities that f falls below the threshold h under input uncertainty.
In ordinary LSE, f is modeled by GP and classified using a credible interval of f (x) [Bryan et al., 2006, Gotovos et al., 2013. On the other hand, the classification target in our setting is the probability p * x , so it is inappropriate to assume GP as in previous studies. Furthermore, acquisition functions such as Straddle [Bryan et al., 2006], LSE [Gotovos et al., 2013] and MILE [Zanette et al., 2018] proposed in previous studies can not be used directly in our setting. In the following subsections, we propose a modeling method for p * x and an efficient acquisition function.

Estimation of H and IU-rLSE
In this subsection, we propose an estimation method of H. The basic idea is to construct a credible interval Q t (x) for p * x and perform classification based on it. First, we assume GP as the prior distribution of f . Then, for each x ∈ X , we define the random variable p t,x which takes a value in the interval [0, 1] as t (x) and γ 2 t (x) are given by . Here, Φ(·) is the cumulative distribution function of standard normal distribution. By using the interval Q t (x), we define respectively estimated sets H t and L t of H and L at the tth trial as Moreover, we define the unclassified set U t = X \(H t ∪ L t ).

Acquisition function
In this subsection, we propose an acquisition function to determine a next evaluation point. Our proposed AF is based on the Maximum Improvement for Level-set Estimation (MILE) introduced by [Zanette et al., 2018]. In MILE, the point that maximizes the expected classification improvement after adding one point is taken as the next evaluation point. However, MILE can not be directly applied under input uncertainty. Therefore, we extend MILE to the setting in this paper, and propose rational approximations. In addition, by combining the proposed AF with random sampling, we show that our proposed algorithm converges with probability 1.

AF based on expected classification improvement and its approximation
Let s * be an entered point, and let y * = f (s * ) + be an observed value corresponding to s * . Moreover, let H t (s * , y * ) denote an estimated set of H when (s * , y * ) is added. Then, the expected classification improvement a t (x) when considering input uncertainty for the point x ∈ X is given by where the expected value in (5) can be expressed as We denotes indicator function 1l[µ where p(y * | s * ) is a density function of y * corresponding to s * , and we use the notation Φ s|y * = . Furthermore, µ t (x | s * , y * ) and σ 2 t (s | s * ) are a posterior mean and variance of f (s) after adding (s * , y * ).
Next, we consider the calculation cost of a t (x). From (5)-(8), in order to calculate a t (x), it is necessary to perform integration three times. When one integral calculation is approximated by M times sampling, the calculation cost of a t (x) is O(|X |M 3 ). However, since this is not a realistic cost, we propose a reasonable approximation of a t (x). For this reason, we approximate (7) and (8) as where s is the expected value of s with respect to g(s | θ x ), and we use the notation Φ s = Φ h−µt(s|s * ,y * ) σt(s|s * ) . Hence, (6) can be approximated as Moreover, the inequality in the indicator function in (9) can be written as follows (details are given in AppendixA.3: where the function f is below the threshold h, but IU-rLSE identifies points where the probability p * x introduced by input uncertainty is above the threshold α. As a result, classified points (green area) by IU-rLSE differ from ordinary LSE due to input uncertainty. Moreover, from figures on the right in the upper row, in ordinary LSE, f is modeled by GP, and classification is performed based on credible intervals of f . On the other hand, in IU-rLSE, it is necessary to construct credible intervals of p * x appropriately.

IU-rLSE
Therefore, the following holds: Moreover, the posterior mean µ t (s | s * , y * ) can be written as follows (see, e.g., [Rasmussen and Williams, 2006]): Thus, noting that µ t (s | s * , y * ) can be expressed as the linear function of y * , the inequality in the indicator function in (9) can be also written as the linear function of y * . Hence, by using the cdf of standard normal distribution, the integral in (9) can be solved analytically because p(y * | s * ) is a density function of normal distribution (details are given in Appendix A.4. From the above discussion, we propose the follow-ing approximate AFâ t (x): where Since (10) has only one integral, the calculation cost of (10) is O(|X |M ). However, approximation accuracy ofâ t (x) is not necessary good becauseâ t (x) considers only the classification of S. As the IU-rLSE progresses and posterior variances of f corresponding to points in S is reduced sufficiently, all points in S are classified. As a result, it is expected thatâ t (x) will not work well after this. To avoid this problem, we consider adaptively determining S for each trial.
Algorithm 1 Proposed LSE Input: Initial training data, GP prior (3), (4) and gener- For each trial t, we define S t as Note thats x is the point which maximizes the integrand in γ 2 t (x | s * , y * ). The pseudo code of our proposed method is shown in Algorithm 1. In the proposed method, for each trial t, with probability 1 − p t , we select x ∈ X based onâ(x), and otherwise uniformly select x ∈ X . Here, B(p t ) in Algorithm 1 is Bernoulli distribution with parameter p t .

Unknown input distribution
In this subsection, we consider the case that the density function g(s | θ x ) is unknown. In this case, it is necessary to estimate it during trials. One natural approach is to assume certain function form for g(s | θ x ) and estimate unknown parameters θ x . Nonetheless, parameter estimation is still difficult if we assume a different θ x for each point x ∈ X . For this reason, we assume that θ x can be separated as θ x = (θ x , ξ), whereθ x and ξ are respectively known and unknown parameters. Then, assuming a prior distribution π(ξ) for ξ, g(s | θ x ) can be estimated using a posterior distribution π t (ξ) after data observation as follows: Therefore, based on (12), we can compute (4), (4), (10), and (11).

Theoritical Result
In this section, we present two theorems for accuracy and convergence. First, for each point x ∈ X , we define the misclassification loss e α (x) as Then, the following theorem holds for classification accuracy: Theorem 4.1. For any α ∈ (0, 1), δ ∈ (0, 1) and > 0, if β 1/2 = (δ/|X |) −1/2 , with probability at least 1−δ, the misclassification loss is less than when the algorithm is finished. That is, the following inequality holds: The proof is given in Appendix B. The next theorem states the convergence property of the proposed IU-rLSE method. Unlike ordinary LSE problem, the coverngence of IU-rLSE is non-trivial since one cannot evaluate the function at desired input points. Therefore, we conduct careful probabilistic analysis on the convergence in the following theorem. The following theorem gives a probabilistic evaluation for convergence of the algorithm under regular conditions (A1)-(A4) (given in Appendix).
The proof is given in Appendix C.

Numerical Experiment
In this section, we compared the performance of existing methods and the proposed method through numerical experiments, and confirmed the effectiveness of the proposed method. For comparison, we considered existing methods Straddle [Bryan et al., 2006], MILE[Zanette et al., 2018] and random sampling. On the other hand, we used β 1/2 = 3 for calculatingâ t (x). Furthermore, estimation of H was also performed using β 1/2 = 3. In this experiments, we set p t = 0 and = 0 for simplicity. Moreover, we used F 1-score as the classification accuracy. In addition, for each synthetic/real function, we calculated the true probability p * x by using 100,000 Monte Carlo simulations and defined the true H.

1d-synthesic function
We confirmed the classification accuracy and the goodness of the approximation of AF in IU-rLSE by using the following function f (x): In addition, we defined X as the grid points when [−0.5, 5.5] divided into 40. Furthermore, we used Gaussian kernel k(x, x ) = σ 2 f exp( x − x 2 /L) and set σ 2 f = 100 and L = 0.5. Moreover, we used σ 2 = 10 −4 as the error variance and h = 8 as the threshold for f (x). In this experiment, we considered the following two distributions as the input distribution: Case1 S(x) = x + Gamma(5, 0.03).
Case2 S(x) = x + N (0, 0.07 2 ). Here, Gamma(a, b) is the gamma distribution with parameters a and b. Experiment results are given in Figure 4. From Figure 4, we can confirm that the proposed method has better performance than existing methods. Note that existing methods Straddle, MILE and RS focus on the classification for f . Recall that our target function is p * x , not f . Thus, since the classification target in existing methods is different, it is natural that the accuracy is low. However, the classification procedure in the proposed method can also be applied to existing methods. Specifically, in each iteration of IU-rLSE, classification is performed using (3), (4), and existing methods are used only for selecting the next evaluation point. In other words, only the acquisition function of the existing method is used, and the proposed method is used as the classification method. Hereinafter, this method will be used as the existing method.

Sinusoidal function
In this subsection, we used f (x 1 , x 2 ) = − sin(10x 1 ) − cos(4x 2 ) + cos(3x 1 x 2 ) as the true function. Here, in numerical experiments in [Zanette et al., 2018], −f (x 1 , x 2 ) was used as the true function. Moreover, we defined X as the grid points when [0, 1] × [0, 2] divided into 30×60. Furthermore, we used the Gaussian kernel with σ 2 f = e 2 and L = 2e −3 In addition, we used σ 2 = 10 −4 and h = −0.5. In this experiment, we assumed that the input was two dimensional random vector whose elements have same distribution and are mutually independent. Furthermore, as the distribution of each element, the same setting as in previous subsection was used. Figure 5 shows the experiment result based on 20 Monte Carlo simulations. From Figure 5, we can confirm that the F 1-score based on the proposed method is larger than those of existing methods.
In this experiment, we assumed the following two cases for the input distribution of each element:

1d-synthesic function with unknown inputs distribution
In this subsection, we considered the situation that input distributions are unknown. We considered the same setting as in Subsection 5.1.1 except input distributions. In this experiment, we considered the following input distribution: Under this setting, we considered the following two cases: Case1 The true parameter is (μ,σ 2 ) = (0, 0.4 2 ), and assume thatμ is known andσ 2 is unknown.
In Case1, we used π(σ −2 ) = Gamma(3, 0.48) as the prior distribution ofσ −2 . Similarly, in Case2, we used π(μ) = N (0, 0.8 2 ) as the prior distribution ofμ. Note that posterior distributions of g t (s | θ x ) in Case1 and Case2 are given by t-distribution and normal distribution, respectively (see, e.g., [Bishop, 2006]). The experiment results are shown in Figure 7. From Figure 7, even in this setting, we can confirm that the proposed method has better performance than existing methods.

Real-Data Experiment
In this subsection, we confirmed the classification accuracy by using the Combined Cycle Power Plant (CCPP) dataset [Dua and Graff, 2017, Tufekci, 2014, Kaya et al., 2012. CCPP contains 9568 instances and consists of four parameters (Temperature, Ambient Pressure, Relative humidity, Exhaust Vaccume) representing the state in CCPP as inputs, and the amount of power generation with respect to time average as the output. Here, accurate control of CCPP state parameters is difficult due to environmental factors and control errors, and there is input uncertainty. We first standardized the output of each instance to average 0, and normalized each input feature to average 0 and variance 1. In this experiment, we first extracted 7568 data randomly, calculated the posterior mean of GP using this, and considered it as the true function. The remaining 2000 data were used as the set of candidate points X . We used Gaussian kernel with σ 2 f = 300 and L = 2, and set σ 2 = 0.5 and h = −15. As the input distribution, we used S(x) = x + N (0, 0.125 2 ). The experiment results based on 20 Monte Carlo simulations are shown in Figure 8. From Figure 8 on left, we can see that the F 1-score for the proposed method is larger than those of existing methods. Furthermore, we performed the similar experiment as in Subsection 5.1.1. From Figure 8 on right, we can see that precision of the proposed method tends to 1. On the other hand, we can also see that precision of existing methods (with focus on the classification of f ) do not tend to 1.

Conclusion
We considered the problem for identifying input points where probabilities that the black-box function f falls below the threshold h are more than α in the situation which inputs have uncertain. We proposed the level set estimation method and acquisition functions by assuming GP as the prior distribution of f and constructing credible intervals for probabilities that f falls below the threshold h under input uncertainty. Through theoretical analysis and numerical experiments, it was confirmed that the proposed method has better performance than other methods.
[ Proof. From the definition of p t,x , it is sufficient to show that the integral is finite ( [Papoulis and Pillai, 2002] Next, the following lemma holds: Lemma A.2. Let δ ∈ (0, 1). Then, with probability at least 1 − δ, it holds that t (x) and γ 2 t (x) are given by (1) and (2), respectively.

A.2 Details of Aquisition Function
In this subsection, we derive several lemmas on the acquisition function. First, the following lemma holds: Then, the solution of the inequality On the other hand, (A.2) can be rewritten as Thus, by using the quadratic formula, the solution of (17) is given by where Moreover, Φ − s and Φ + s satisfy Next, we assume Φ + s > 1. Then, (17) does not have any solutions on [Φ − s , 1]. However, (17) holds when Φ s = 1. This is a contradiction. Hence, we get Φ + s ≤ 1. This implies that Finally, from (16), (18) and (19) we obtain Next, we derive a lemma on the exact form of the integral in the acquisition function.
Lemma A.4. Let p(y * | s * ) be a probability density function of normal distribution with mean µ t (s * ) and variance σ 2 t (s * ) + σ 2 . Then, it holds that Proof. By substituting (21) into the indicator function in (20), we have Next, let Then, (20) can be written as Therefore, noting that p(y * | s * ) is the normal density function, from symmetry of normal distribution we get B Proof of Theorem4.1 Proof. From Lemma A.2, putting β 1/2 = (δ/|X |) −1/2 , for any x ∈ X it holds that p * x ∈ Q T (x) with probability at least 1 − δ/|X |, where T means t at the end of the algorithm. Hence, with probability at least 1 − δ, for any x ∈ X it holds that p * x ∈ Q T (x). Therefore, by combining this result, the classification rule and the definition of e α (x), we get Theorem 4.1.

C Proof of Theorem4.2
In this section, we derive a theorem on convergence properties of the algorithm. First, we define several notations. For each x ∈ X , let Thus, D x is the set of points that can be observed when x is observed. Then, defineD as follows: Furthermore, let A t be an input random variable at t th trial, and let Y At be an output random variable corresponding to A t . Then, defineμ t as a posterior mean function based on the data {(A i , Y Ai )} t i=1 . Next, we assume the following four conditions: with probability 1.
(A3) For any x ∈D, the kernel function k is continuous at (x, x).
(A4) For any ξ > 0 and x ∈D, there exists δ ξ,x > 0 such that |σ 2 t (x) − σ 2 t (x )| < ξ for any t ≥ 1, x 1 , . . . , x t ∈D and x ∈ N (x; δ ξ,x ). Condition (A1) satisfies when each η t is greater than a positive constant c. Similarly, when η t = o(t −1 ), (A1) also holds. Condition (A2) requires that the probability that an input point falls in a region where the posterior mean approaches the threshold h can be reduced sufficiently when δ ξ becomes small. Condition (A3) requires that the kernel function k is continuous onD ×D and (A4) requires the equicontinuity for the sequence of posterior variances. Under these conditions, Theorem4.2 holds. The proof is given in Subsection C.1-C.2.

C.1 Preparation of the proof
In this subsection, we provide two lemmas for proving Theorem 4.2. First, for any finite subset Ω ofD, the following lemma holds: Lemma C.1. Assume that conditions (A1) -(A4) hold. Then, with probability 1, for any x ∈ Ω, it holds that The proof is same as that of Theorem 4.2 in [Inatsu et al., 2019], we omit the details.
Next, the following lemma on the compactness of D holds: Lemma C.2. The setD is compact.
Proof. From the definition ofD, the setD satisfies D ⊂ D. In addition, noting that D is bounded, we have thatD is also bounded. Hence, it is sufficient to show thatD is a closed set. Let cl(D) be a closure of D. Then, we proveD = cl(D). From the definition of the closure, we getD ⊂ cl(D). Next, we show cl(D) ⊂D. Let x be an arbitrary point of cl(D). then, since the number of elements in X is finite, the following formula holds: cl(D) = cl a∈X D a = a∈X cl(D a ).
Hence, we get x ∈ D x ⊂D because ξ is an arbitrary positive number. Thus, it holds that cl(D) ⊂D. Therefore, we haveD = cl(D). Finally, by using the fact that the closure is a closed set,D is also closed.
Note that U is an open cover ofD. In addition, from Lemma C.2,D is compact. Hence, U has a finite subcover U ≡ {N (x i ; δ a/2,xi ) | i = 1, . . . , U, x i ∈D} ⊂ U .
Hence, for any x ∈ Ω and x ∈ N (x; δ a/2,x ), it holds that σ 2 T (x ) < a. Thus, using this inequality and (26), we can show that σ 2 T (s ) < a for any s ∈D. Recall that the positive number a satisfies Φ(−δ ξ /a 1/2 ) < ξ. Consequently, we obtain In other words, with probability 1, there exists a number N such that max x∈X γ 2 N (x) < 2ξ.
Finally, from the definition of the classification rule, each point x ∈ X is classified to H t or L t if β 1/2 γ t (x) < . Hence, if max x∈X γ 2 t (x) < 2 β −1 , all points are classified. Therefore, since ξ is any positive number, putting ξ = 2 −1 2 β −1 we have Theorem 4.2.