Ellipsoidal Subspace Support Vector Data Description

In this paper, we propose a novel method for transforming data into a low-dimensional space optimized for one-class classification. The proposed method iteratively transforms data into a new subspace optimized for ellipsoidal encapsulation of target class data. We provide both linear and non-linear formulations for the proposed method. The method takes into account the covariance of the data in the subspace; hence, it yields a more generalized solution as compared to the data description in the subspace by hyperspherical encapsulation of target class data. We propose different regularization terms expressing the class variance in the projected space. We compare the results with classic and recently proposed one-class classification methods and achieve competing results and show clear improvement compared to the other support vector based methods. The proposed method is also noticed to converge much faster than recently proposed Subspace Support Vector Data Description.


Introduction
The ability of machines to make a concise description of information requires learning from previous experience.Researchers have been trying to develop techniques for accurately modeling data using supervised and unsupervised learning techniques for many decades.In unsupervised learning techniques, patterns are found without any knowledge of class labels [1].In supervised learning, labeled training data is used to train models for classifying future instances into different categories [2].A typical multi-class classification task can be decomposed into several binary class classification tasks, where the aim is to decide to which of the two considered classes samples belong to [3].In binary classification, the data from both classes are used to train a model.Oneclass classification is conceptually close to binary classification, but the models for classifying future instances are trained using data from the objects of one particular target class.The aim in one-class classification is to build a model for predicting future instances by using data from a single class only [4], [5].In practice one-class classification is used when data from one of the classes is scarce.
In one-class classification, the class of interest to be modeled is called target or positive class, while samples from the other unknown class(es) are referred to as outliers or negative samples.Numerous attempts have been made to solve one-class classification tasks [6].The three main approaches for solving one-class classification tasks are density based, reconstruction based, and border based methods [7].In the density based approach, the description of the target class is based on its density [8], which is usually estimated by using popular density estimation methods such as Parzen density, Gaussian model, or mixture of Gaussians [9].In reconstruction based approach, some assumptions about the data generating process are made.The underlying function which represents the target class is obtained by fitting a curve over the data by using prior information, such as data clustering characteristics.Self-organizing maps (SOM) [10] and least-squares quantization [11] are classic examples of reconstruction methods.In border based approaches, a model is created by defining a closed boundary around the target class without estimating its density.One-class Support Vector Machine (OC-SVM) [12] and Support Vector Data Description (SVDD) [13] are among the popular boundary techniques for one-class classification.In OC-SVM, a hyperplane separating the target class is constructed so that the distance of the hyperplane from the origin is maximized.In SVDD, a hypersphere is formed around the target class data by minimizing the volume of hypersphere in a given feature space.SVDD has been justified over time as a powerful data description method and it has been used in many different application domains for solving one-class classification problems.For example, in [14], SVDD is found to be an excellent choice for solving the problem of identification of freshness of eggs using near infrared spectroscopy (NIR) with an imbalanced number of training samples.In [15], a terrain classification method for ensuring navigation safety of mobile service robots based on SVDD is proposed.To enhance the performance of SVDD, numerous extensions and hybridization techniques have been proposed [8], [16], [17], [18], [19] [20].The main extensions of SVDD can be categorized into four main categories.In the first category of extensions, the techniques are focused on manipulating the structure of data, such as associating a confidence coefficient with all training instances which deals with the uncertainty of data [21].In the second category, the performance is enhanced by proposing new non-linear methods and reducing the complexity of algorithms [22], [23].Techniques for handling non-stationary data in the context of one-class classification falls in the third category of extensions [24].In the fourth category, different changes are proposed in the shape of the boundary encapsulating the target data [25].
A popular alternative to the spherical SVDD is Ellipsoidal SVDD (E-SVDD) [25], [26].E-SVDD forms a unique hyperellipsoid with a minimum volume covering most of the target data.An ellipsoid, unlike a hypersphere, takes into account the difference in variance for each dimension as well as covariance between them.A hypersphere, characterized only by a radius and a center will result in superfluous regions which do not contain any target objects in the input space [27].Ellipsoids with a minimum volume containing the target data have applications spanning over many different fields.For example, in [28], it is used to detect intrusion in computer networks and, in [29], it is used to estimate the distance between a robot and its surrounding environment for obstacle collision avoidance.An ellipsoid is preferred for heterogeneous data in the input space because its shape is less conservative than a sphere.However, there are some difficulties in kernelizing the algorithms.The kernel trick cannot be applied directly to E-SVDD because its formulation includes outer products rather than inner products [30].
In this paper, we propose a novel subspace learning algorithm for ellipsoidal one-class classification.The proposed method takes into account the covariance of data in the subspace so that the boundary created around the target class is a better fit.The proposed method finds a projection along with data description iteratively by minimizing the volume of the hyperellipsoid.We also propose a non-linear version of the algorithm by exploiting the non-linear projection trick (NPT) [31].The proposed method is called Ellipsoidal Subspace Support Vector Data Description (ES-SVDD), since it is analogous to Subspace Support Vector Data Description (S-SVDD) proposed in [32] but offers more flexibility by using hyperellipsoids instead of hyperspheres.
The rest of the paper is organized as follows.In Section 2, we present an overview of related works.In Section 3, a detailed derivation of the newly proposed method is presented.In Section 4, we provide and discuss the experimental protocol along with the obtained results and, finally, conclusions are drawn in Section 5.

Background and related work
One-class classification has been studied extensively in recent years and the approaches predominantly focus on data description in a given feature space [7], [13], [21].On the other hand, feature selection and subspace learning have been an active research area in machine learning, primarily for challenges with data available for all categories [33], [34].The aim is to avoid the curse of dimensionality in the original feature space by modeling the given data in a lower dimensional space.
In feature selection methods, a subset of representative features is selected by following some criterion [35], [36], [37].The two main approaches for feature selection are the filter approach and the wrappers approach.In the filter approaches, the main focus is on the intrinsic characteristics of the data and they do not take into account any classification algorithm.On the other hand, the wrappers approaches are dependent only on a specific classification algorithm [38].
In subspace learning, the features are transformed from original feature space to a lowerdimensional subspace [39].Most of the existing subspace learning methods, particularly for anomaly detection, follow three general steps [40], [41]: First, the features are selected randomly by applying random projections to the attributes.Second, classical algorithms are applied locally in each subspace and scores (e.g., voting) are computed.Finally, all the scores are aggregated to compute a global score for classification.
The focus of our paper is to find an optimized subspace for one-class classification.We review the classical one-class classification method, SVDD, in Section 2.1 and also provide an overview of S-SVDD and graph embedded one-class classifiers in Sections 2.2 and 2.3, respectively.

Support Vector Data Description
Let us denote the data points to be enclosed inside a closed boundary by a matrix where N is total number of instances and D is dimensionality of data in the original feature space.All the data samples represented by X belong to the same class.SVDD finds a spherical boundary around the data by minimizing the volume of a hypersphere enclosing all the target class data: where R is the radius of hypersphere and a ∈ R D is the center of the hypersphere in the given feature space.Slack variables ξ i , i = 1, . . ., N are introduced for allowing the possibility of data points being outliers, hence the optimization problem changes to where C > 0 is a hyperparameter which controls the trade-off between the volume of the sphere and the amount of data outside the sphere.The Lagrangian dual of Eq. ( 2) reduces to subject to 0 ≤ α i ≤ C. Maximizing Eq. ( 3) gives a set of α i for corresponding data points.
The samples with α i > 0 are the support vectors defining the data description [13].The samples corresponding to 0 < α i < C lie on the boundary of the hypersphere and those with α i = C are outliers.

Subspace Support Vector Data Description
In S-SVDD [32], a projection matrix Q is determined to map data from the original space R D to a new optimized lower dimensional space R d , d < D, so that the data is more suitable for one-class classification: where a ∈ R d is the center of the hypersphere in lower d-dimensional space.The method iteratively solves the SVDD in the current subspace to obtain the data description parameters α i , i = 1, . . ., N , and then updates the subspace projection by optimizing an augmented version of the Lagrangian: where ψ is a regularization term expressing the class variance in the low dimensional space and β is a regularization parameter controlling the importance of the ψ, where where Tr(.) is the trace operator and λ ∈ R N is a vector controlling the contribution of each training sample.Q is updated by using the gradient of Eq. ( 5), i.e., where η is the learning rate.A non-linear version of S-SVDD employing the kernel trick is also proposed in [32].

Graph embedded one-class classifiers
Graph embedded one-class classifiers constitute extensions of the OC-SVM and SVDD by incorporating generic graph structures in their optimization process.The generic graph structures express geometric data relationships of the target class in the data.For example, Graph Embedded SVDD (GE-SVDD) [16] optimization problem is formulated as where φ(.) is any non-linear function used for mapping the training samples from the input feature space to the kernel space.The matrix S contains the geometric data relationships.For example, in PCA, the scatter of training data can be expressed as where 1 ∈ R N is a vector containing all values as ones, I ∈ R N ×N is an identity matrix, and Φ is a matrix that contains the training data representations in kernel space.The Lagrangian of GE-SVDD is It has been shown in [16] that the optimization problem in Eq. ( 10) is equivalent to the problem of SVDD in a transformed feature space.

Ellipsoid Subspace Support Vector Data Description
Our aim is to find a projection matrix Q ∈ R d×D to be used for transforming the data to an optimized subspace suitable for one-class classification.In the following analysis, we assume that the data has been centered by setting X ← X − µ, where µ is the mean of the given training data.The mapping from the original feature space with dimensionality D to a subspace with dimensionality d ≤ D is carried out.The mapping is done to transform the data so that it is more suitable to be encapsulated inside an ellipsoid with a minimum volume.
The optimization problem is formulated as where a is the center of the hyperellipsoid and E = QXX Q is the covariance matrix of the data in d-dimensional space.The inverse of covariance matrix E, also known as the concentration or precision matrix is symmetric and positive definite E −1 ∈ R d×d .By defining a new vector u = E − 1 2 a, Eq. ( 11) can be written as The data in the subspace is represented by The constraints in Eq. ( 12) can be incorporated into its corresponding objective function by using Lagrange multipliers: with Lagrange multipliers α i ≥ 0 and γ i ≥ 0.
By setting partial derivatives with respect to R, u and ξ i to zero, we get By substituting Eqs. ( 15)-( 17) into Eq.( 14) we get We can use SVDD to solve Eq. ( 18) for getting α i values.The concentration matrix E −1 is equivalent to By putting Eq. ( 19) in Eq. ( 18) we get We add an extra term Υ to Eq. ( 20) as a regularization term expressing the class variance in the projected space, also taking into account the concentration matrix.Hence, Eq. ( 20) now becomes where β controls the importance of regularization term and is used as a hyperparameter.Υ is defined as follows: where λ can take three different forms.In the first form, all elements in λ take the value of 1 and, hence, all the samples are used to describe the covariance of the class.In the second form, λ is replaced by α, which means that the samples belonging to the boundary and outside the boundary are used to describe the covariance of the class.In the third form, the λ i values are replaced by α i values of the samples belonging to the boundary and zero for other instances.The first, second and third forms of the regularization terms are expressed as Υ 1 , Υ 2 , and Υ 3 hereinafter.
In our experiments, we also consider the regularization term expressing the class variance in the projected space without taking into account the concentration matrix.This is achieved by replacing the covariance matrix E with the identity matrix I in Eq. (22).By doing so, the regularization term Υ becomes equivalent to ψ as described in Eq. ( 6).Analogous to regularization term Υ, ψ can also take different forms by changing λ and similarly hereinafter we refer to all those cases by ψ 1 , ψ 2 and ψ 3 .The methods used with ψ and Υ are denoted by ES-SVDDψ m and ES-SVDDΥ m (m = 1, 2, 3), respectively.We refer to the case, where no regularization term is used in ES-SVDD, as ES-SVDDψ 0 Υ 0 .
Eq. ( 21) can be further simplified and written as where A is a diagonal matrix having α i values in its diagonal and α is a vector of α i 's.We use gradient of Eq. ( 23) to update the projection matrix.The gradient can be solved using identity 126 in [42]: where When ψ is used as a regularization term, we use ∆ψ instead of ∆Υ in Eq. ( 24): We obtain an optimised data projection matrix along with optimised data description in a two-step iterative process.In first step, the α i values are computed by maximizing Eq. ( 18).In second step, Q is updated through the gradient descent after computing the gradient by using Eq. ( 23).In order to obtain an orthogonal projection, we impose the orthogonality constraint QQ = I.We orthogonalize and normalize Q during the two-step iterative process.Algorithm 1 presents the whole algorithm.

Non-linear Ellipsoidal Subspace Support Vector Data Description
The non-linear ellipsoidal subspace SVDD is not trivial, because the kernel trick cannot be applied directly due to the outer products involved in its derivation.To avoid this problem, we follow the NPT based solution described below [31].We first compute a noncentered kernel matrix K = Φ Φ using the radial basis function kernel as where σ is a hyperparameter scaling the distance between x i and x j .The kernel matrix is centered as where J ∈ R N ×N is a matrix defined as The centered kernel matrix is decomposed by using eigendecomposition: where A contains the non-negative eigenvalues of the centered kernel matrix in its diagonal and the columns of U contain the corresponding eigenvectors.Finally, the data in the reduced dimensional kernel space is obtained as where + sign in the superscript denotes the pseudo-inverse.
After applying NPT, we continue by considering Φ as our input data.This allows as to use the linear E-SVDD formulation to obtain a non-linear transformation.

Test phase
During the test phase of the linear case, a test instance x * is first mapped to the optimized lower d-dimensional space as The decision to classify the instance as target or outlier is taken on the basis of its distance from the center of data description in the d-dimensional space.The distance is calculated as follows: where u can be solved with Eq. ( 16).If y * − u 2 2 ≤ R 2 , the test instance is classified as positive, as it will fall inside the boundary of the data description.The test instance is classified as negative if y * − u 2 2 > R 2 .The threshold R 2 for taking the decision is calculated as follows: where s is any support vector with 0 < α i < C.
During the test phase for non-linear ES-SVDD, we use NPT by first computing the kernel vector as The kernel vector is then centered as The centered kernel vector is then mapped to We now consider φ * as the test input x * and follow all the steps described for the linear test.

Datasets and experimental setup
We evaluated the proposed and competing methods over different datasets downloaded from UCI machine learning repository [43].The datasets were converted to one-class datasets by considering a single class in the dataset at a time as the target class and all other classes as outliers.The total number of samples, number of target class samples, and number of dimensions in the original feature space are given in Table 1.In each dataset, 70% of the data was used for training and the remaining 30% for testing.The train and test sets were selected randomly by keeping the proportions of classes similar to the full dataset.Each experiment was repeated five times using different random train/test splits, while the same five splittings were used for all the considered methods.We report the average test performance over the five splittings.During training, a 5-fold cross-validation technique was used to select the best hyperparameters with the best evaluation score.We used only the training sets for selecting the hyperparameters.We used Geometric mean (Gmean) as the evaluation metric for all the methods.Gmean is defined as where tpr is true positive rate (also known as sensitivity) and tnr is true negative rate (also known as specificity).For the proposed ES-SVDD method, we chose the hyperparameters from the following values For all the competing methods, the hyperparameters corresponding to ES-SVDD hyperparameters were selected from the above values.For other hyperparameters, the same ranges were used as provided in the corresponding work.We used the full training set with the optimal hyperparameters for the final training and then tested with the test set.
We compared the proposed ES-SVDD with S-SVDD [32], OC-SVM [12], SVDD [13], and E-SVDD.For non-linear methods, we employed NPT for ES-SVDD and S-SVDD, and kernel trick for other methods.For S-SVDD, different regularization terms were proposed and, hence, we compared with all the previously proposed variants of S-SVDD.We used LIBSVM [44] toolbox implementation for OC-SVM and SVDD.The proposed ES-SVDD along with S-SVDD and E-SVDD were implemented by the authors using Matlab by leveraging LIBSVM.

Experimental results and discussion
In Tables 2 and 3, we report the average test results for each dataset for the linear and nonlinear cases, respectively.In each experiment, a single class was selected as the target class and the rest of the data as outliers (see Table 1).We also report the average performance of the proposed and competing methods in the average (Av.)column by averaging the results for a given dataset.For example, the performance over S-K, S-R, and S-C is averaged and provided in the Av.column as the overall performance for Seeds dataset.In this way, we can get an idea of the overall performance for each algorithm over the full dataset.For ES-SVDD and S-SVDD, we report the test results after 10 training iterations.
In the linear case, our proposed methods achieved the best average results on all but Iris dataset and, in the non-linear, the proposed methods achieved the best averages on half of the datasets.We note that the average results for the non-linear methods are generally better than those of the linear ones in the majority of the datasets.Overall, the proposed (linear and non-linear) methods For non-linear ES-SVDD, the regularization terms ψ 1 and Υ 3 resulted in the best results for most of the datasets.However, ψ 1 is also noticed to perform worse than the others in a few datasets.ψ 1 uses all target training samples in describing the covariance of the class without taking into account the concentration matrix.In Υ 3 , the λ values take the values of α i values of the support vectors and zero for non-support vectors.In the non-linear case for high dimensional datasets, we notice that using all the training data for describing the covariance of the data in a projected space, with or without using the concentration matrix (i.e., ψ 1 or Υ 1 yielded the best results for ES-SVDD.more robust that other regularization terms.Hence, we recommend to use samples belonging to the boundary and outside the boundary to describe the covariance of the class while taking into account the concentration matrix.
We also show the performance of the proposed ES-SVDD and the recently proposed S-SVDD on the test set after every training iteration for the linear and non-linear cases.We compare the performances of these methods with different regularization terms Υ and ψ.The average Gmean value is calculated for each iteration over the 5 test splits for the different datasets, see Figures 1-2  and Appendix.It can clearly be seen from the figures that for both the linear and non-linear methods, ES-SVDD achieves its best performance much earlier than the recently proposed counterpart S-SVDD.This is not surprising, because the ellipsoidal description can fit a larger variety of data distributions, while the optimal spherical description gets successful only after the data variance for different dimensions has been equalized.Using the ellipsoidal data description in the proposed method makes it converge faster to an optimal solution.We also notice that for high dimensional datasets ES-SVDDψ 1 and ES-SVDDΥ 1 are more stable as compared to the other proposed linear and non-linear methods.Overall, the trend of a faster convergence and a higher stability in terms of producing consistent results for different range of iterations for ES-SVDD can be observed both in the linear and non-linear methods for all regularization terms in the majority of the cases.

Conclusion
In this paper, a novel method, ES-SVDD, for one-class classification is proposed.The proposed method projects the data from an input feature space to a new optimized subspace suitable for one-class classification.The proposed method generalizes S-SVDD for a hypersphere by using ellipsoidal data description.We proposed different regularization terms along with linear and non-linear formulations of the method.In most cases, the proposed ES-SVDD variants outperform the competing methods and converge faster than the case of data description without ellipsoidal encapsulation.
In the future, we intend to use other kernel types in the non-linear case of ES-SVDD.We also plan to devise a strategy for early exit in the training process to reduce the training time.We will also experiment with finetuning hyperparameters according to different criteria, such as area under receiver operating characteristic curve.

Figure 1 :
Figure 1: Comparison of different regularization terms for ES-SVDD and S-SVDD on dataset S-K

Figure 2 :
Figure 2: Comparison of different regularization terms for ES-SVDD and S-SVDD on dataset QB-B This work was supported by a NSF-Business Finland CVDI project Amalia and a Business Finland projects VIRPA D and INDEX (DIMECC Industrial Data program).

Figure 5 :Figure 6 :
Figure 5: Comparison of different regularization terms for ES-SVDD and S-SVDD on dataset IS-B

Table 1 :
Datasets used in the experiments