Scalable Hyper-Ellipsoidal Function With Projection Ratio for Local Distributed Streaming Data Classification

Learning streaming data with limited size of memory storage becomes an interesting problem. Although there have been several learning methods recently proposed, based on the interesting concept of discard-after-learn, the performance of these issues: the learning speed, number of redundant neurons, and classification accuracy of these methods can be further improved in terms of faster speed, less number of neurons, and higher accuracy. The following new concepts and approaches were proposed in this paper: (1) a more generic structure of hyper-ellipsoidal function called Scalable Hyper-Ellipsoidal Function (SHEF) capable of handling the problem of a curse of dimensionality by introducing a regularization parameter into the covariance matrix of SHEF; (2) a new recursive function to update the covariance matrix of SHEF based on only the incoming data chunk; (3) a fast and easy conditions to test the states of being overlapped, inside, and touching of two SHEFs; (4) a new distance measure for determining the class of a queried datum based on the projected distance on only one discriminant vector, namely the Projection Ratio. The experimental results show the significant improvement when compared with the results from VLLDA, ILDA, LOL, VEBF, and CIL in terms of classification accuracy, the number of generated neurons, and computational time.


I. INTRODUCTION
Classifying or learning streaming data has been an interesting topic and existed in many fields such as business (financial data [1], credit card fraud detection [2]), academia, and medical information (health care sensor [3], EEG signal [4], [5]). Streaming data refers to the data that are continuously generated without any time bound. The amount of generated data at any point in time may be varied according to their environment. The data flow into a learning process in forms of either one datum at a time or one chunk of unfixed size at a time. One of the obvious examples of streaming data is the data temporally generated throughout the internet in every unit time. It is remarkable that the speed of data generation goes beyond the speed of developing the technology for memory capacity per unit area. Due to this technology lag, the streaming data cannot be entirely stored inside the memory causing the problem of how to accurately learn these data The associate editor coordinating the review of this manuscript and approving it for publication was Claudio Cusano . in real time and online applications. Generally, this type of data has many names such as streaming data, data streams, online data, and real-time data. This situation leads to a very challenging development of new neural learning algorithms to cope with the problems of learning streaming data under the constraints of memory overflow, fast learning speed, as well as limited computing resources such as low energy consumption [6]- [8]. Moreover, the neural learning speed must be faster than the speed of incoming data in order to avoid any data loss and to achieve the classification accuracy with respect to the occasional testing data.
Streaming data can be in various forms but this study concerns only the numeric data in the form of continuous chunks of vectors representing a set of features extracted from some learning objects such as images, characters. The process of how to extract important features is not considered in this study. In addition, some specific characteristics of data and targets such as concept drift [9]- [12] and time series are not involved because these characteristics occur only in some applications. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Traditional classification methods such as k-Nearest Neighbors (k-NN), Support Vector Machine (SVM), Linear Discriminant Analysis (LDA) were built on the condition that the entire data set must be kept in the memory during the training process. These traditional methods can also be applied to streaming data but the results are not accurate because the entire training data set is not known during the training period. Only those incoming data are involved in the training process. Besides, the new incoming data must be combined with the previously trained data in order to retrain the network but this creates the consequence of memory overflow. Some new incoming data are lost after the memory overflow crisis. The retraining of augmented training data set makes the number of epochs uncontrollably increase. However, some attempts such as incremental learning were introduced to solve the memory overflow constraint by controlling the number of deployed neurons.
Incremental learning method gradually learns the data from small portions of a whole data set by adding one or more neurons at a time. It is suitable to learn big data or streaming data when memory overflow may occur. This approach has been applied and combined with various classification methods such as LDA [6], [13]- [15], Neural Network [16], ILDA [17], and SVM [18], [19]. Although incremental learning achieved some success, the constraint of memory overflow still exists and the learning time is still uncontrollable. There are several structures of mathematical functions behind the incremental learning based on the data distribution which can be either linear or non-linear.
Handling a nonlinear classification problem may be divided by three main approaches which are (1) a nonlinear kernel classifiers such as a Radial Basis Function kernel [20], (2) a local hyperplane [21], and (3) a mathematical structure based on local data distribution such as Hyper-ellipsoidal function: the family of Versatile Elliptic Basis Function (VEBF) [22], [23], k-Nearest Neighbors [24], and k-mean clustering [25]. Some approaches [24], [25] require parameters such as the center and the variance of data distribution of each class to be computed before the training process. This implies that the entire training data must be known in advance but it is impossible to do so in case of streaming data.
Although the VEBF-based structure is rather convenient and efficient for the new learning environment and constraints, the learning accuracy depends upon the initial values of VEBF parameters. The problem is that the incoming chunk of data is not the actual population of the whole training data. Thus, the initial values computed from only the incoming data chunk may be inaccurate. Furthermore, updating these parameters must be continuously done during the learning process which makes the time complexity unavoidably increases. Besides the problem of initializing parameters, the procedure to identify the classes of a queried testing datum is mainly based on the distance between the datum and the center of VEBF. This distance works well for some applications but it gives inaccurate classification results in most cases because the data density and distribution inside the VEBF is not involved. The data density and distribution can be viewed as the information stored inside the VEBF. Identifying the class of a queried datum must minimum interfere the data information of the corresponding VEBF. Another problem not studied in those VEBF-based structure is the situation when the number of feature dimensions is larger than the number of training data. This implies that the eigenvectors and eigenvalues of the covariance matrix of a VEBF cannot be uniquely computed. Many applications, especially for medical data, face this crisis.
A new classifier method named Scalable Hyper-Ellipsoidal Function (SHEF) method is proposed to solve the above mentioned problems in VEBF-based structure learning process. Our approach concerns the direction of data distribution inside any local cluster. The local characteristics imply the local importance of each dimension which can be used to compute the nearest distance between a VEBF and a queried datum in order to identify its correct class. Several new concepts are introduced to solve the problems such as projection ratio, checking overlap, being inside, and touching of two hyper-ellipsoids.
The rest of paper is organized as follows. Section II addresses the constraints and studied problems. Section III summarizes the relevant concept and background used in this study. Section IV discusses the new structure and the equations for updating the parameters of SHEF. Section V explains the proposed new learning method using scalable hyper-ellipsoidal function and illustrates our algorithm stepby-step. Section VI proposes our new distance measure and its concept in order to determine class of queried datum. Experimental data and set-up hyperparameters are mentioned in Section VII. Section VIII reports the experimental results and performance comparison with the other methods, also shows effects of our hyperparameters on classification accuracy and number of SHEFs in Grid Search diagram. Section IX discusses the rationale behind the results. Section X concludes the paper.

II. CONSTRAINTS AND STUDIED PROBLEMS
The proposed method in this study is based on the hyper-elliptical structure similar to those methods in Versatile Elliptic Basis Function (VEBF) [22], Class-wise Incremental Learning (CIL) [23], Versatile Hyper-Elliptic Clustering (VHEC) [26], and Dynamic multi-Stratum (Dstratum) [27]. These methods solve the problem of memory overflow occurring in streaming data environment by deploying the concept of discard-after-learn, where either one datum or one data chunk of several classes are learned at a time and completely discarded after that. To improve some inferior capabilities of these methods, few new concepts are introduced to speed up the computation of parameter updating of the structure, to relief the curse of dimensionality, and to measure class distance for testing data.
The structure of hyper-ellipsoid is simple and versatile enough for classification but the accuracy of classification of this structure depends upon the initial width of each created hyper-ellipsoid. To elaborate how initial width effects the accuracy of classification, Fig. 1 illustrates an example of how VEBF and CIL methods capture four synthetic data sets in the second and the third row, i.e. Sparse-dense data, Faraway4 data, Spiral Data, and Gaussian data, in terms of the number of hyper-ellipsoids and their directions. Orginally, VEBF [22] was designed to learn one vector at a time in order to reach the lower bound of learning time complexity. It explores only the entire training set to compute the average pair distance and uses this distance as the initial width of each hyper-ellipsoid without touching any vectors in the testing set. But CIL [23] extended the concept of VEBF by deploying only the first 20% of the training data. Note that if the initial width is not properly initialized, then hyper-ellipsoids from different classes may overlap. Moreover, if the initial width is too small, then too many redundant hyper-ellipsoids are also created during the learning period.

A. CONSTRAINTS
This study focuses only on streaming data environment. Each datum consists of a feature vector and a target. The following constraints are considered in this study. 1) New incoming data gradually flow into the learning process as a chunk of data. The memory size is large enough to hold the incoming data chunk.
2) The probability of data distribution in each class is unknown in advance. 3) Each chunk may contain one or more classes. The number of incoming classes is unknown in advance.
4) The size of each class in each incoming chunk at any time is arbitrary. 5) Learned data are assumed to have no class drift or class change characteristic. 6) Incoming data are completely discarded from the learning process after being leaned (discard-after-learn concept). 7) The size of working memory is assumed to be fixed throughout the learning and testing processes. 8) The learning process is executed by one single processing unit.

B. STUDIED PROBLEMS
The following problems are studied to increase the classification accuracy, to reduce the number of neurons, and to solve the condition of curse of dimensionality due to insufficient amount of data which remains existent in those VEBF-based approaches.
1) How to handle the singular covariance matrix problem due to the condition of curse of dimensionality? 2) How to compute the appropriate initial width without using the whole training data in advance? 3) Is there any new distance measure to select the nearest hyper-ellipsoid with respect to the queried datum?
Some relevant backgrounds used in this study will be given in the next section.

III. RELEVANT BACKGROUND
Our proposed method is related to the structure of hyper-elliptic function introduced in [22], [23] and the VOLUME 8, 2020 concept of linear discriminant analysis. The summary of each related issue is given in the following subsections.

A. BASIC CONCEPT OF STANDARD HYPER-ELLIPSOID FUNCTION
Let x i ∈ R d , 1 ≤ i ≤ N , be the i th d-dimensional data vector written in the form of column vector. Suppose a set of data vectors X = {x 1 , x 2 , . . . , x N } belongs to class A. The distribution directions of all vectors in set X and the variance of data in each direction can be captured by using the covariance matrix of set X. This covariance matrix can be easily computed by the following equation. Let S denote this covariance matrix.
where E[·] represents the expected value. To realize the concept of discard-after-learn, it would be better to compute matrix S in the form of summation as defined in (3).
where c ∈ R d is the mean or centroid of data vectors in X.
The distribution directions of all data vectors in set X are the set of eigenvectors of S, denoted by U = {u 1 , u 2 , . . . , u d } such that each ||u i || = 1. The data variances of all eigenvectors are the set of corresponding eigenvalues, denoted by = {λ 1 , λ 2 , . . . , λ d }. The covariance matrix S can be diagonalized in terms of its eigenvalues and eigenvectors as S = U U T . The generic equation of hyper-ellipsoid can be written in terms of covariance matrix S, center c, and a constant ξ as follows. Constant ξ is for adjusting the size of the hyper-ellipsoid, usually set to 1.

B. CONCEPT OF LDA WITH MULTIPLE CLASSES AND DICHOTOMOUS CLASSES
Suppose X = {x 1 , x 2 , . . . , x N } is a set of N data vectors with K classes. Let C k denote class k and n k be |C k |,for 1 ≤ k ≤ K . The covariance matrix of each class C k is denoted by S k . The centroid of X is at c and the centroid of each class k is at c k . The traditional LDA aims to find (K − 1) discriminant vectors formed as a d-by-(K − 1) projection matrix W = [w 1 · · · w K −1 ] in order to maximize the following Fisher criterion [28]. Between-class scatter matrix S B and within-class scatter matrix S W are defined as follows.
For a special case, when LDA is only used for a dichotomous class problem (two classes, K = 2), the projection matrix W consists of only one discriminant vector w of size 1. The value of w can be computed by the following equation.

C. CHECKING OVERLAP OF TWO HYPER-ELLIPSOIDS
Checking the overlap of two hyper-ellipsoids is essential in order to merge two hyper-elliptic structures of the same class into a larger one. This paper modified the method of checking touch of two ellipsoids at a single point proposed by Alfano and Greer [29]. Their concept is as follows. Let A and B be the represented matrices of the first and second ellipsoids, respectively. Suppose X and Y are data vectors for the first and second ellipsoids, respectively. The equations of both ellipsoids can be written as follows.
Assume that X is in both ellipsoids when they touch each other. Thus, we have Testing touch of both ellipsoids can be transformed into the process of formulating eigenvalue by these steps. A constant λ is multiplied to matrix A in (11) first.
Hence, the relation |λI − A −1 B| = 0 is the condition for testing the touch of two ellipsoids. | · | represents the determinant of a matrix.

IV. PROPOSED SCALABLE HYPER-ELLIPSOIDAL FUNCTION AND PARAMETER UPDATING
The previously introduced hyper-ellipsoidal functions [22], [23], [26], [27] were not designed to solve the problem of curse of dimensionality which occurs in various applications. To improve this inferior capability, a new function of hyper-ellipsoid called scalable hyper-ellipsoidal function is proposed as follows.

A. STRUCTURE OF SCALABLE HYPER-ELLIPSOIDAL FUNCTION
Expanding or shrinking the size of a hyper-elliptic function requires the computations of eigenvectors and eigenvalues first. To reduce this prior computations, the following generic form of standard hyper-elliptic function was used in our approach. Instead of setting the right-hand side of standard hyper-elliptic function to a constant of one, this constant is replaced by a positive scalable constant r so that the width of hyper-ellipsoid shape in each dimension can be easily scaled by using only r. The equation of this new scalable hyper-ellipsoidal function (SHEF) is defined as follows.
x ∈ X is a data vector. c is the center of X. S is the covariance matrix of X. Let λ i be the eigenvalue of the i th dimension of SHEF when r 2 = 1. If r 2 = 1, then the new eigenvalue becomes r 2 λ i , which obviously implies that the λ i is scaled by r 2 . Fig. 2 illustrates the geometrical structure and its eigenvalues when the scalable constant r > 0 as defined in (16) with radius r √ λ i at the i th dimension. However, the directions of eigenvectors are not scaled or changed by r.
In several applications such medical data classification, the covariance matrix S can be singular due to less amount of data than the number of dimensions. To avoid this condition, the concept of regularization [30], [31] was adapted to SHEF by adding a small positive constant to the covariance matrix S as shown in the following equation.
I is a d-by-d identity matrix. In this paper, a regularization parameter is set to 0.0001.
Lemma 1: Let S and S * be two covariance matrices such that S * = S + I. Covariance matrix S * has the same set of eigenvectors as those of S and each eigenvalue λ * i = λ i + . Proof: The covariance matrix S can be factorized in terms of U and as follows: Substitute (18) into S * = S + I, we obtain the following equation.
Since U is an orthogonal matrix, so UU T = I. Hence, In case of zero covariance matrix (or there is only one datum in SHEF), S becomes singular. Hence, the initial width of SHEF in each dimension will be set to √ instead.

B. UPDATING PARAMETERS OF SHEF
Each SHEF contains four parameters: the number of captured data (n), the centroid of captured data (c), the covariance matrix of captured data (S), and the class of captured data (z).
Since the training process is based on the concept of discardafter-learn, an incoming datum will be discarded after being captured by any SHEF. So the first three parameters of SHEF must be updated accordingly to the recently incoming datum. Assume that incoming datum x new ∈ R d is captured by the j th SHEF of the same class. Let n old j , c old j , c new j , S old j , and S new j be the current number of data vectors, the current centroid, the updated centroid, the current covariance matrix, and the updated covariance matrix, respectively. To cope with the possibility of data overflow and to preserve the time and space complexities when employing the concept of discard-afterlearn, [22], [23], [26] suggested the following set of recursive functions for computing and updating the new centriod and new covariance matrix.
Although these recursive functions efficiently support the concept of discard-after-learn, it is possible to speed up the updating process of covariance matrix by rewriting (23) as stated in the following Theorem.
Theorem 1: A new covaraince matrix S new j can be computed by the following recursive function.
The proof of (24) is given in APPENDIX A. Note that the time spent on computing c old (23) is reduced by computing only (c old (24). Although Theorem 1 addresses only one incoming datum, (24) can be adapted to VOLUME 8, 2020 an incoming data chunk by updating the covariance matrix with one datum at a time.

V. PROPOSED NEW LEARNING METHOD USING SCALABLE HYPER-ELLIPSOIDAL FUNCTION
Streaming data flow into the learning process in one multi-class chunk at a time. Let = (X (1) , X (2) , · · · , X (t) , · · · ) be the sequence of streaming data chunk at different time t. Each i and its target class y (t) i . The capturing process focuses on one datum at a time with the following main steps. Assume that class y 1) Capturing an incoming data chunk by introducing a new SHEF or by expanding some existing SHEF of the same class k. The criteria for performing each operation depend upon: (1) the minimum distance and median distance among data within each class and (2) an adaptive threshold distance based on the number of SHEFs of the same class k and the amount of data in each SHEF of class k. 2) Merging two SHEFs of the same class k into one larger SHEF to reduce the number of SHEFs of class k. The merging criteria is based on degree of overlap between two nearest SHEFs of the same class k based on Euclidean distance. Prior to the learning algorithm based on these two main steps, the computational detail in each step is discussed first in the following sections.

A. INITIALIZING SHEF WIDTHS AND THRESHOLD DISTANCE FOR INTRODUCING NEW SHEF
At the starting step of learning process, the initial size of the first SHEF for capturing the first data chunk of a class, say class k, must be defined. If the class has only one datum, then a constant as introduced in Lemma 1 is deployed as the initial width of SHEF in all dimensions. Otherwise the width of SHEF in each dimension is computed by the following equation. Suppose x (1) j is in class k and the amount of data in this class is n k . Let dist(x (1) j ) be the Euclidean distance from x (1) j to its nearest neighbour of the same class in the first incoming chunk, where j = 1, 2, . . . , n k . The initial width, denoted as dist_init k , of SHEF in class k is set up as follows.
Note that this initial width is used for each dimension i of SHEF in class k. The initial width of all new classes appearing after the first chunk will be set to √ . The value of threshold distance is used to determine whether a new SHEF in the same class should be introduced to capture a new incoming datum or not. This threshold distance is used to control the number of SHEFs generated during the learning process. If there are too many SHEFs, then the over-fit problem is occurred and the computational time obviously is increased. But if there are too few SHEFs, then the misclassification of queried data may result. The distance concerns two factors. The first factor is the amount of data in each SHEF of the same class. The second factor is the number of existing SHEFs of the same class. A merging threshold distance is defined based on these two factors in the the following paragraph.
Let M be the predefined minimum amount data allowed within each SHEF of class k. Suppose there are m k SHEFs whose amount of data in each SHEF is less than M . The threshold distance of class k, denoted as dist_ths k , is defined as follows. For the first chunk, dist_ths k is set to dist_init k . But for the other incoming chunks, the threshold distance is set by (26).
From (26), its concept is that if there exist many inefficiently generated SHEFs (each SHEF captured data less than M), the threshold distance should be scaled up.

B. CONDITION OF INTERSECTION OF TWO SCALABLE HYPER-ELLIPSOIDS
The structure scalable hyper-ellipsoids in this paper is different from the structure studied by Alfano and Greer [29]. Their structure is based on the standard elliptic function, where the right-hand side of elliptic equation is set to zero but SHEF employs a scaling positive constant r 2 as defined in (16) instead. However, their technique of deriving the intersecting condition was adapted to our scenario. Suppose two SHEFs of the same class, SHEF α and SHEF β intersect. The following theorem states the conditions of intersection of two scalable hyper-ellipsoids.
Theorem 2: Both of SHEF α and SHEF β do not overlap each other if all eigenvalues of the following matrix P are all distinct real numbers and some of them are negative. Otherwise they are in one these states: overlap, inside, or touch.
Centroids c α and c β are of SHEF α and SHEF β , respectively. S will be derived in Section VI-C.
The proof of Theorem 2 is given in APPENDIX B. If two SHEFs of the same class satisfy the conditions in Theorem 2, then both of them are merged into a larger SHEF γ and all relevant parameters are updated as follows.
The learning process of SHEF consists of three main procedures. The first procedure (Steps 1-3) is initializing the width of the first SHEF based on the condition stated in Section V-A and (25). The second procedure (Steps 6-12) is checking the condition for introducing a new SHEF to capture new incoming data based on the threshold distance discussed in Section V-A and defined in (26). The last procedure (Steps 14-21) is merging two SHEFs of the same class according to the overlap constraints in Section V-B and Theorem 2 by using (28), (29), and (30). The detail of learning algorithm is in Algorithm 1.
An example of how SHEF learning algorithm according to Algorithm 1 works is illustrated in Fig. 3. There are two incoming chunks. The first chunk is shown in Fig. 3a. There are 11 streaming data belonging to two classes. Five circles are data in class 1 and six squares are data in class 2. For the first chunk, one datum at a time in each class is captured by a SHEF as shown in Fig. 3b -Fig. 3i. A dotted-line SHEF denotes the state of SHEF after capturing the data while a solid-line SHEF denotes the state of SHEF when its shape is expanded and rotated to capture new data. An opaque circle and an opaque square denote the data before being discarded from a SHEF. Four SHEFs are used to capture the first chunk. The second chunk is shown in Fig. 3j. There is only one datum of class 2 in this chunk. Fig. 3j -Fig. 3l illustrate how this datum is captured by SHEF 2 . After capturing this datum, SHEF 2 and SHEF 3 are merged and replaced by SHEF 5 .

VI. IDENTIFYING CLASSES OF TESTING DATA
Determining the class of queried datum is also a very essential step to achieve the highest classification accuracy. Generally, the class of queried datum is decided by finding a cluster having the nearest distance measured from either the centroid (center) of the cluster or the boundary of the cluster to queried datum. Although this approach is very practical and rather efficient, the accuracy of classifying datum depends strongly upon the shape of data distribution of the cluster. A new improvement of measuring the nearest distance based on local LDA is proposed in this study. Some of the popular distance measures for class identification are briefly summarized as follows. Let x is a queried vector in d-dimensional space, S be a covariance matrix of the considered SHEF in d-dimensional space, and c be a centroid of the SHEF.
where || · || represents Euclidean norm and T represents a transpose.

Algorithm 1 Learning Procedure of SHEF for Current Incoming Data Chunk
Input: (1) a set of N pairs of datum and target X = ((x 1 , y 1 ), · · · , (x N , y N )) for incoming data chunk at any time.
(3) a set of SHEFs from previous learning procedure (if the incoming chunk is not the first chunk.) (4) a constant M denoting the minimum of data in any SHEF. Output: a set of SHEFs and their updated parameters.
1. If the first data chunk then 2. Initialize dist_ths y j = dist_init y j using (25) for every class y j in chunk X.
If there exists a set of SHEFs of class y i then 6.
Let c j , n j , and S j be the centroid, amount data, and covariance matrix of captured data of SHEF j of class y i , respectively. 7.
Let ξ = arg min Introduce a new SHEF of class y i to capture x i and update dist_ths y i using (26). 10. else 11.
Put x i in SHEF ξ and update parameters by using (22) and (24). Update n j = n j + 1.
Discard pair (x i , y i ) from the training set. 14.
Let SHEF α be SHEF capturing x i .
Deploy the conditions in Thm.2 to test overlap SHEF α and the nearest SHEF β of class y i .
Update parameters of SHEF γ using where S −1 represents the inverse of S. VOLUME 8, 2020

3) THE VERSATILE ELLIPTIC BASIS FUNCTION VALUE
VEBF [22] and CIL [23] use their shape-function value as a decision function to measure the closeness between a sample point x to SHEF. Their versatile elliptic basis function of the k th neuron is defined as follows.
where {u 1 , u 2 , · · · , u d } are eigenvectors of a covariance matrix of covered data and a i is the width of each axis of VEBF.

4) APPROXIMATE BOUNDARY DISTANCE
Measuring the distance with respect to the boundary of any hyper-ellipsoids is rather complex, hence an approximate distance was proposed. Zimmermann and Svoboda [32] proposed an approximate distance between a sample point x to the nearest boundary of an ellipse. It can also be applied to a high dimensional space of the ellipse, called a hyper-ellipsoid. This distance is measured on the line connecting the sample point and the centroid. The line intersects the boundary of the ellipse at a specific point. The actual distance is measured from the intersection point to the sample point. Instead of using the ellipse, they transformed the shape of the ellipse as a unit circle by matrix transformation which is much simpler as shown in Fig. 4.
The interesting concept of this strategy is described as follows. First, the original ellipse and the sample point are transformed by the inverse of the matrix L T into a unit circle. This matrix is obtained by factorizing the covariance matrix representing an ellipse with a Cholesky factorization (S = LL T ). The distance from a sample point to the unit circle can be easily computed by using Euclidean distance. After that, the distance value will be re-transformed to the original shape using the matrix L T . Approximate boundary distance according to the above concept is defined by the following equation.
where (·) −1 represents the inverse of the matrix and || · || represents Euclidean norm. Wattanakitrungroj et al. [26] also proposed a method to compute the distance between the boundary of full micro-cluster and a data point by solving equations following the concept in Fig. 5. Although both methods [26], [32] deploy different definitions of ellipsoid, they end up with the same distance approximation. However, [26] takes less computation time and calculation steps than [32]. Approximate   boundary distance according to [26] is defined by the following equation.

B. LIMITATION OF EACH DISTANCE METHOD
Usually, the closeness between a point x and SHEFs can be easily determined in terms of Euclidean distance either from x to the centroid or from x to the boundary of SHEFs. However, the simplicity may lead to the wrong interpretation. For example, suppose there are two SHEFs, namely SHEF 1 and SHEF 2 . Fig. 6 shows an example of the Euclidean distance, see (31), from a testing datum x to the centroids of SHEF 1 and SHEF 2 . The distance from x to the centroid of SHEF 1 is longer the distance to the centroid of SHEF 2 . Thus, x must be assigned to SHEF 2 instead of SHEF 1 . But in fact, the correct closeness of the testing datum x in this example is SHEF 1 because it is closer to SHEF 1 than SHEF 2 . To achieve the correct identification, the closeness distance should be measured with respect to the boundaries of SHEF 1 and SHEF 2 as shown in Fig. 7. However, measuring the closest distance from the point x to the boundary of SHEFs in a high dimensional space is rather complex. Although there exists the approximate ways such as the methods of Zimmermann and Svoboda [32], see (34), and Wattanakitrungroj et al [26], see (35). They do not take the distribution of data into account.
Since Mahalanobis distance, see (32), is a distance measure between x and the considered SHEF at the same distribution, it needs to update a centroid and a covariance matrix with a new point. If the amount of data is extremely generated, it is expensive to calculate all updates.
Using the VEBF value to measure the closeness following (33) may lead to the wrong interpretation. Because of a size of each hyper-ellipsoid (a i ) is updated without depending on its distribution.
In this paper, a new similarity measure (distance measure) was designed based on the distribution of data inside each SHEF. The details of our new distance measure are described as below.

C. NEW DISTANCE MEASURE OF SHEF PROJECTION WIDTH
A new distance measure, namely Projection Ratio, was proposed. It relaxes the mentioned limitation of each distance. The Projection Ratio considers the tradeoff between the distance from the data point to the boundary of SHEF with the local data distribution of SHEF. Instead of using the direct distance between a testing datum to the boundary of SHEF, our closeness distance between any SHEFs and the testing datum is defined as the distance between the projected SHEF and the projected testing datum onto the linear-discriminantanalysis (LDA) vector. In this section, the projections of the SHEF boundary and its center onto the LDA vector (discriminant vector) are described as follows.
Let w be the discriminant vector. Suppose that w is already known. A considered SHEF is projected onto w as depicted in Fig. 8. Suppose that points A and B are the projected boundary points of SHEF onto w. The centroid c is projected onto w at position c using (36). c = w T c.
(36) VOLUME 8, 2020 Note that ||A − c || and ||B − c || are equal and they can be defined as the projected width of SHEF onto the discriminant vector w. Actually, without knowing the locations of points A and B, the projected width can be computed by the square root of the projected covariance matrix of SHEF. The geometrical width of each eigenvector of SHEF directly correlates with the covariance matrix of captured data (S) and the scaling positive constant (r). It is equal to r √ λ i , where λ i is an eigenvalue of S. This width value is actually the square root of eigenvalue along the eigenvector of the covariance matrix of SHEF. SHEF is just a shape hyper-ellipsoid without any data. All captured data of the SHEF are entirely discarded. Thus, the covariance matrix of SHEF for projection onto w must be derived from S.
Let S be the covariance matrix of SHEF with its eigenvaluesλ i = r 2 λ i and S be the covariance matrix of captured data vectors with its eigenvalues λ i . Since both S and S are computed from the same data set, their eigenvectors U are the same. From (18)  ||A − c || = ||B − c || = w T Sw = r w T Sw (38) where w T Sw is a non-negative scalar. || · || represents Euclidean norm. Note that, our projection method is similar to Pope's idea [33] using the different definition of hyperellipsoid function. Deriving the discriminant vector w in our case is different from that of LDA. LDA computes a discriminant vector from two classes of data but in our case there are only one datum and a SHEF which is representation of data. Therefore, the method of LDA cannot be directly applied to compute the discriminant vector in our case.
Let S and c be the covariance matrix and the centroid of captured data of SHEF, respectively. Suppose x is a single queried datum. In order to find the discriminant vector for both SHEF and x, datum x is reconsidered as the centroid of another SHEF. By Using (8) with this circumstance, the discriminant vector can be derived as follows. Fig. 9 shows an example of discriminant vector w p and the projected width of SHEF as well as the projected location of data point x onto w p . There are two significant distances to be used. The first distance |w T p (x − c)| is the projected distance from the projected centroid to the projected location of x. The second distance r w T p Sw p is the projected width of SHEF. The Projection Ratio distance from data point x to SHEF is defined as follows.

D(x, c, S)
Based on this Projection Ratio, determining whether data point x is inside or outside SHEF can be easily done. The following Theorem states the conditions to indicate the location of x with respect to SHEF. Thus, x should be assigned to the class of SHEF 1 . b) Otherwise, Projection Ratio is deployed to both SHEFs onto a new discriminant vector w defined in (8) instead, as shown in Fig. 10. Denoted by

VII. EXPERIMENTAL DATA AND SET-UP
All experiments were tested with two groups of data. The first group has four 2-dimensional synthetic data sets (see the illustration from the first row of Fig. 1): Sparse-dense data, Faraway4 data, Spiral data, and Gaussian distributed data. The second group contains seven real-world data sets from the University of California at Irvine (UCI) Repository of the machine learning database [34] which were selected by varying the number of features, the number of data, and the number of classes. For Letter, Shuttle_trn, and Kddcup99 data set, they were selected to imitate streaming data. For Kddcup99, four symbolic features which are more than two categories were removed. So there are 38 features in this data set. The description of all data sets are given in Table 1.
A 5-fold cross validation was used in all experiments for each method. Training data in each data set were divided into several chunks of randomly different sizes in order to simulate the environment of streaming data. These chunks were sequentially fed into the training process. The amount of data in each chunk was uniformly and randomly distributed and pre-defined by a variable called chunk-size range. Table 2 summarizes the chunk-size range of each data set and the number of training chunks in each fold. In order to study the effect on the order of the incoming data in each data set, 10 groups of different shuffled order of training chunks in each data set were created for each fold. Total experiments per one data set were 50 experiments.
Since the experiments are in a streaming environment in forms of sequential data chunks, the testing process in [6], [8], [35] was adopted. After learning one incoming chunk, the classification accuracy is evaluated by a testing set and a new chunk is learned next. The accuracy of each testing chunk is called Chunk-wise Accuracy (CA).
For streaming data evaluation, assume that training data was divided into T chunks. Let CA t , for 1 ≤ t ≤ T , be the evaluation accuracy of the (t + 1) th chunk as a test set after training the t th chunk as a train set. The following Average Cumulative Accuracy of chunk t (ACA t ) is measured after training the t th chunk.
The proposed method is also efficiently capable of learning non-streaming data, where there is only one train set and one test set. A 5-fold cross validation is used to evaluate the method. To distinguish the accuracy of non-streaming data classification from chunk-wise accuracy, this accuracy of non-streaming data is called Population Accuracy (PA). The experimental results were compared with the results produced by the methods designed for learning streaming data with the concept of one-pass learning which are VEBF [22], LOL [21], and CIL [23] and with the concept of retained learning which are ILDA [17] and VLLDA [24]. However, these compared methods have similar and different characteristics as summarized in Table 3. The meaning of each characteristic is the following. 1) Sequential means learning one incoming datum at a time. 2) Chunk means learning one incoming chunk which may have several classes of data at a time. 3) One-pass means discarding all current training data after being learned, which may be a datum or a chunk of data. 4) Stream means processing without knowing any prior statistical information of data in advance and not using that data in advance to set any initial parameters. 5) FRM means using the feature reduction method for the classification problem. 6) Local means using only the information from the incoming data distribution to solve nonlinear separable problem.
The experiemntal results of proposed method were compared to those produced by VLLDA [24], ILDA [17], LOL [21], VEBF [22], and CIL [23]. All methods have different parameter settings. In this experiment, the parameter setting of each method is summarized in Table 4.
For VEBF and CIL, the constant δ was set to scale the initial width of VEBF shape function calculated from the average distance. Parameter settings for the data sets of Segment, Spambase, Waveform, and Letter were referred from [23]. VEBF used whole training data to calculate their initial average distance whereas CIL used the first 20% of total training data. Unfortunately, the number of training data in Shuttle_trn and Kddcup99 is too huge to be calculated in a short time. Hence, only first 10,000 training data of Shut-tle_trn and Kddcup99 were calculated for VEBF and only first 10,000 training data of Kddcup99 were calculated for CIL. For LOL, the authors claimed that LOL is not very sensitive to the parameters and suggested the settings of three parameters, namely the number of prototypes k was set to 60; the balancing parameter λ was set to 1.0; and the aggressive parameter C was set to 1.0 for all experiments. For VLLDA, the parameter k must be set to be congruent with the k-nearest neighbour method. For ILDA and VLLDA which use LDA for classification, the number of selected discriminant vectors was set according to the number of non-zero eigenvalues of S −1 W S B . For our proposed method, the following hyperparameters were set: constant r was set to 1.5; regularization parameter in Lemma 1 was set to 0.0001; minimum number of data M in SHEF was set to 3 for all experiments without tuning.

VIII. EXPERIMENTAL RESULTS AND PERFORMANCE COMPARISON
All experiments were conducted on a desktop PC with 8 GB RAM, Intel Core i7-4770, 3.4 GHz with licensed Matlab code. The dimensions of data sets vary from two dimensions to some higher dimensions. For two dimensions, Fig. 11 shows the results obtained from proposed SHEF. Note that these VEBF, CIL, SHEF approaches are based on the same concept of discard-after-learn. For other methods using different concepts and higher dimensional data sets, the results are compared and shown in Table 5 instead. The independent t-test was adopted to measure the statistically significant difference between the average value of proposed method and other methods. The value with asterisk (*) means that there is no statistically significant difference at the 5% significance level (p > 0.05). The experimental results concerned the following issues. The average population accuracy and the standard deviation of all methods in each data set are reported in Table 5. The compared methods used different approaches to learning data. The first approach is retaining all training data throughout the training and testing processes as implemented in VLLDA and ILDA. But the second approach uses the concept of discard-after-learn. Each datum is learned in one pass and discarded afterwards. No need to retain any incoming datum throughout the training and testing processes as implemented in LOL, VEBF, CIL, and our proposed method. To obtain a clear comparison based on these two approaches, the testing was conducted into two categories according to each approach.
Without tuning any hyper-parameters, our method SHEF achieved better accuracy with statistically significance in 10 out of 11 data sets compared to the methods using the second approach and in 4 out of 9 data set when being compared to the methods using the first approach. The better accuracy is shown in bold numbers and underlined numbers. Notice that there is a statistically significant difference between the average population accuracy of SHEF and other methods in the second approach (LOL, VEBF, and CIL) on every data set. LOL has the lowest accuracy on Spiral data, Segment, Spambase, and Waveform. VEBF has the lowest accuracy with large standard deviation on Shuttle_trn. CIL has the lowest accuracy on Satimage, Letter, and Kddcup99. For Kddcup99 data set, VLLDA and ILDA could not finish the learning process within one hour. For the first approach, VLLDA and ILDA have quite good accuracy on all data sets, except ILDA method on Spiral data. Fig. 12 shows the chunk-wise accuracy for six methods after training in each chunk on seven real-wolrd data sets. One of 50 experiments was picked up to show the accuracy at each chunk of streaming data. Since VLLDA and ILDA used and retained all incoming training chunks (the first chunk to the current chunk of streaming data) to classify the testing data, thus they were recomputed with cumulative training data from all previous training chunks. It is remarkable that LOL has a wide swinging range of accuracy on Waveform, Satimage, and Kddcup99. VEBF has a wide swinging range of the accuracy on Spambase and Shuttle_trn. There is a sudden change of the accuracy value in VEBF on Shuttle_trn data set to be discussed in the next section. CIL has a wide swinging range of accuracy on Waveform, and Letter. For Kddcup99, CIL has the low accuracy in every chunk of data. The average cumulative accuracy of the results in Fig. 12 calculated by using (41) is shown in Fig. 13. This type of accuracy indicated the trend of accuracy as the result of incremental learning the incoming chunk after chunk. For the proposed method SHEF, the average cumulative accuracy trended to increase in all data sets when SHEFs were incrementally trained. Table 6 shows the average number of generated neurons (or prototypes). The compared methods can be grouped by information used to generate the neurons during the learning period. The first group consists of VLLDA and ILDA. Both VLLDA and ILDA generated neurons based on the number of training data in each data set. The second group consists of LOL, VEBF, CIL, and SHEF. LOL generated neurons based on the number of prototypes (set to 60). VEBF, CIL, and SHEF generated neurons based on hyper-ellipsoidal shape function to capture the incoming data. At the end of training VOLUME 8, 2020 P. Rungcharassang, C. Lursinsap: SHEF With Projection Ratio for Local Distributed Streaming Data Classification  all incoming chunks, the average number of generated SHEFs is less than or equal to those of the others in 10 out of 11 data sets. There are statistically significant difference between the average number of neurons in our proposed method (SHEF) and the other methods in every data set, except for VEBF on Waveform. There are six data sets, i.e. Gaussian data, Segment, Spambase, Waveform, Satimage, and Letter, where our method generated the average number of neurons close to the number of classes. However, there is one data set, Kddcup99 data set, where the number of generated neurons is less than the number of classes (23 classes). The detail will be discussed in the next section. Fig. 14 shows the number of generated neurons for three methods (VEBF, CIL, and proposed SHEF) during training in each chunk on seven real-world data sets. VEBF and SHEF merge two overlapped hyper-ellipsoids. Therefore, the number of neurons during the training process might be increased or decreased according to their approaches. Since CIL does not have any merging strategy, the number of neurons is always increased.

B. AVERAGE NUMBER OF GENERATED NEURONS
C. AVERAGE COMPUTATIONAL TIME Table 7 reports the computational time of training and testing processes. Training time is a total spending time after training all of streaming data chunks on each data set. Testing time is a spending time of the test set from the 5-fold cross validation. There is a statistically significant difference between the average of computational time (both training and testing) of the proposed method (SHEF) and the other five methods, except for the testing time of LOL on Sparsedense data. VLLDA is processed without training so only the testing time is reported. The training time of VEBF, CIL, and SHEF included the computational time of calculating  the initial distance of hyper-ellipsoids. When the number of data is not huge, the computational time of all methods may not be different. However, the last three data sets (Letter, Shuttle_trn, and Kddcup99) are quite big. SHEF is obviously faster than others methods for both training and testing data. For example, SHEF spent about 14 seconds for 395,216 training data and about 94 seconds for 98,804 testing data in Kddcup99 data set whereas the other methods spent a much longer time for training and testing, especially for VLLDA and ILDA. They could not finish the process within one hour. Actually, VLLDA and ILDA spent about four hours and three days for Kddcup99 data set, respectively. Fig. 15 shows the computational time of four methods (LOL, VEBF, CIL, and SHEF) for learning each incoming chunk on Kddcup99 data set (263 chunks). Fig. 15a shows the training time of the first ten chunks. For VEBF, CIL, and SHEF, the computational time to initialize the width of a hyper-ellipsoidal function was included in training time of the first chunk. Fig. 15b shows the training time of each chunk and Fig. 15c shows the cumulative training time from Fig. 15b of each chunk. Cumulative training time is the summation of the training time of each consecutive incoming chunk, starting from the first chunk to the current incoming chunk.

D. EFFECTS OF SHEF HYPER-PARAMETERS ON CLASSIFICATION ACCURACY AND NUMBER OF SHEFs
The values of two hyper-parameters r and in (17) were specially varied in this section to analyze how their values effect the classification accuracy and the number of generated SHEF neurons. Hyper-parameter was set to 10 −6 , 10 −4 , 10 −2 , 10 0 , and 10 2 and positive hyper-parameter r was set to 0.5 to 2.5 with a step size of 0.5. Fig. 16

IX. DISCUSSION
Definitely, VLLDA and ILDA are not suitable for a streaming scenario because they need to store the entire incoming data for classification. Besides, when data are big such as Shuttle_trn and Kddcup99, both VLLDA and ILDA methods spent too much training time because the time obviously depends upon the size of training data. However, VLLDA and ILDA still hold the advantages in classification accuracy because they employ the concept of k-NN and LDA. Both VLLDA and ILDA may be chosen as a traditional learning method in a streaming environment. Therefore, comparing the performances of VLLDA and ILDA to the performances of LOL, VEBF, CIL, and SHEF approaches is not appropriate due to the different concepts of handle the constraint of space complexity.
ILDA which is an incremental version of LDA also has the same limitation as LDA. When data are non-linearly separable, it will not work properly as seen in the results of Spiral data (accuracy 67.85%) in Table 5. The time complexity of updating equations for both S B and S W affects the training time when the training data are big. Hence, the more data are trained, the more time is obviously spent. VLLDA based on LDA with k-NN can improve LDA to handle the complex distribution of data as seen in the results of Spiral data in Table 5. But choosing k may affect the accuracy. Furthermore, VLLDA needs to compute Euclidean distances among all training data no matter how value of k is set. Hence, the more data are tested, the more time is obviously spent.
For LOL, the number of prototypes k set to 60 in all experiments may not be suitable for all data sets in the term of accuracy for the Spiral data, Spambase, and Letter. That means using only 60 hyperplanes may not be enough to classify those data sets. As the results, LOL is sensitive to the parameter settings in each data set and inconsistent with their conclusions in [21]. Moreover, the sequence of the incoming data has an effect on updating positions of prototypes and accuracy as seen in the results of Sparse-dense data and Far-away4 shown in Table 5 with a large standard deviation. Their computational time is directly proportional to the number of the training data and the number of classes, especially the number of classes. Since LOL adopts the one-vs-all strategy in the multi-class problem, LOL must iteratively learn one class at a time.
For VEBF, there is a large standard deviation of accuracy on Sparse-dense data, Faraway4, Gaussian data, Spambase, Waveform, Satimage, Letter, and Shuttle_trn. Consequently, the sequence of incoming data has an effect on the accuracy of this method that is consistent with the conclusion in [23]. For Shuttle_trn data set, VEBF has the unusual low accuracy at 38.343% with the unusual large standard deviation of 30.677 as shown in Table 5. To explain this situation, the accuracy characteristic displayed in Fig. 12f should be thoroughly analyzed. After considering all 50 experiments of VEBF on this data set, there are 31 out of 50 whose accuracy is immediately decreased at some incoming chunks and never increased again. This problem was found in the step of merging two neurons of VEBF algorithm when the initial width was set too large. After merging them, the width of the new neuron was set by assuming the Gaussian distribution of data in both neurons that may be smaller than the widths of two previous individual neurons. As the result, updating the parameters of a neuron based on the next training data may be dominated by a larger size of neuron. The training time is directly proportional to the number of the generated neurons during the training process. Whereas the testing time is directly proportional to the number of neurons in the final process. For example of Kddcup99 data set, there are about 117 generated neurons in the final state, see Table 6. They took about 515 seconds for 395,216 training data and about 804 seconds for 98804 testing data, see Table 7. Furthermore, if the number of neurons was unnecessarily generated, they would be stored in forms of d-by-d covariance matrices.
For CIL, the number of training data in each chunk is inversely proportional to the number of generated neurons. If the number of training data in each chunk is quite small, then the number of neurons may be redundantly generated (no merging strategy in this method). The number of generated neurons is directly proportional to the computational time. However, a large number of neurons may reduce the classification accuracy. Our findings show CIL is quite sensitive to the number of training data (chunk size) in each chunk on some data sets (Satimage and Letter  Table 5, it is not actually sensitive to the number of train data in each chunk due to the definition of the decision function for determining the class label of the testing data. This conclusion was confirmed by using the generated neurons from CIL in the first experiment of fold 1 and our proposed testing approach instead. The accuracy increased from 19.61% to 99.712%. with 2,292 neurons. For our proposed SHEF, the performance depends on how SHEFs and the network of SHEFs are appropriately constructed to capture sequential streaming data. According to the 2-dimensional illustration in Fig. 11, it may be noticeable that no matter how complicated the patterns of data distribution are, SHEFs can capture them very well. The proposed method provided the high accuracy more than 80% on several data sets during the incremental learning in the training process at any time stamp of streaming data (see Fig. 12). Although there are more steps in our testing process than the testing process of other methods, our testing process is rather fast because fewer numbers of SHEFs were effectively generated on all experimental data sets. Moreover, our method also spent less training time than the others because the merging step helped reduce the number of generated SHEFs during the training process. For Kddcup99 data set, about 16 SHEFs were generated because there are only 15 classes out of 23 classes that have the number of data more than 20. Before testing the data set, any SHEFs containing the number of members fewer than M (M = 3) were eliminated. Therefore, some classes might be ignored as noisy data. The number of training data in each chunk did not affect massively to our accuracy. Even though the input data were fed as the chunk, our algorithm still sequentially learns each datum. The sequence of incoming data has a slight effect on the accuracy of SHEF compared with other methods by considering from our quite small standard deviation of all data sets in Table. 5. Fig. 16 describes the effect of two hyper-parameters on the accuracy and the number of generated SHEF neurons. The preliminary experiments (see Fig. 16) found that when and r increased, the average number of generated neurons tended to decrease. The constants and r in the learning algorithm are directly proportional to the size of SHEF which can indirectly effect the merging strategy. Due to small values of and r, the size of SHEF was too small. As a result, any two SHEFs of the same class may not be merged, it may lead to many redundant SHEFs generated. Whereas the size of SHEF with large values of and r makes many SHEFs overlap. As a result, many SHEFs may be unnecessarily merged. Therefore, if we consider only the high accuracy, the value of and r should be chosen in the range of 10 −6 to 10 −2 and 0.5 to 1.5 respectively, such as the experiments on the 11 data sets.
Three synthetic data sets (except Gaussian) were generated with a complex distribution. The number of generated neurons has an effect on the classification accuracy. Hence, choosing and r has an important factor. Whereas Gaussian data set was generated with a normal distribution (a simple distribution), and r may have a slight effect on the accuracy. However, they have a major effect on the number of generated neurons and the computational time on Gaussian data. Consider on seven real-world data sets, it was found that when r was fixed, varying has a slight effect on accuracy but it has an effect on the number of generated neurons. Especially, on Kddcup99 data set, has an obvious effect on the number of generated neurons. On the other hand, when was fixed, varying r has an effect on the accuracy.
The upper bound of time complexity SHEF learning algorithm can be analyzed as follows. Assume that there is only one class in the first chunk of streaming data with N 1 vectors in d dimensions. Initializing the width takes O(dN 2 1 ). In learning steps, the time complexity was analyzed for one datum at a time. Assume a datum is in class k and there are h

X. CONCLUSION
This paper contributed the following issues to cope with learning streaming chunk data and determining the class of queried data in order to achieve higher classification accuracy and less number of neurons: 1) A generic structure of Scalable Hyper-Ellipsoidal Function (SHEF) with a regularization parameter and a scalable constant r to solve the problem of curse of dimensionality. 2) A new recursive function to update the covariance matrix of SHEF with less computational time than those of the other methods. 3) A fast and easy conditions to test the interacting states including overlap, inside, and touch of two SHEFs. 4) A new distance measure named Projection Ratio based on the projected distance on LDA discriminant vector. The proposed concepts can achieve higher accuracy, less computational time, and less number of generated neurons. However, this approach has not been tested with more complex data characteristics such streaming data with dynamic class drift. Let d be the dimensions of data; S −1 i of size d × d be the scaled covariance matrix S −1 /r 2 of i th SHEF; R i of size (n + 1) × (n + 1) be the translation matrix of the centroid of the i th SHEF; x be any data point, c i be the center of data in the i th SHEF. Then the representation of SHEF in [29] would be where S −1 i is the inverse matrix of S i , 0 d×1 is a zero vector, 0 1×d is the transpose of 0 d×1 , and I d×d is an identity matrix, From (48), two candidate SHEFs should be checked for overlap if they satisfy the following conditions.
x T (λA − B)x = 0 (51) where I is an identity matrix. Term λI − A −1 B can be used to determine the states of touch, overlap, and inside. In [29], the intersection of SHEF A and SHEF B can be described with an eigenvalue of the matrix A −1 B or B −1 A where for SHEF A and SHEF B , respectively.
Since for any SHEF i Hence, the matrix A −1 B can be simplified to our block matrix as where

APPENDIX C PROOF OF THEOREM 3
Proof: Suppose the projected x is inside the projected SHEF (D(x, c, S) < 1) on the discriminant vector w p defined in (39). The projection must satisfy the following inequality.
Substitute w p = S −1 (x − c) It is obvious to see that Equation (59) implies that x is inside SHEF scaled by r in the original space following (16). The conditions of outside and boundary can be proved by the similar argument of these inequalities |w T p (x − c)| > r w T p Sw p and |w T p (x − c)| = r w T p Sw p , respectively.