Introduction
Outlier detection is a classic problem in data mining with many applications, such as network intrusion detection, environmental monitoring, fraud detection, etc. Most studies on outlier detection adopt unsupervised method [1] primarily because the insufficient sampling of outlier class makes determining the classification hyperplane between normal data and outliers infeasible. Hence, supervised methods are disregarded and unsupervised methods are adopted. The latter is generally based on specific assumptions, such as low-density or cluster assumptions. The low-density assumption assumes that outliers are always located in the low-density area of input space, and cluster assumption typically assumes that normal data form clusters and non-cluster data tend to be outliers. However, these preset assumptions are not always in accord with the semantic outlier. When these assumptions cannot match, a high false positive rate will result. Another disadvantage of unsupervised outlier detection methods is that they cannot consider labeled data. In real-world applications of outlier detection, few labeled data can still be obtained. For example, a human analyst can give labels to partial data when he/she examines the results of an unsupervised outlier detection method. Thus, a mechanism that considers label information is necessary when applying unsupervised methods.
When few labeled data are provided, the problem is transformed into a semi-supervised classification problem. However, traditional semi-supervised classification methods suffer from the same problem as supervised methods when applied to outlier detection. Thus, they cannot be used directly; moreover, it has been demonstrated that semi-supervised outlier detection should be based on an unsupervised outlier detection method [2]. Several attempts have been made to address this problem. Most of these attempts can be regarded as based on unsupervised outlier detection methods. In [2], the authors added the constraint of labeled data to one-class support vector machine (OC-SVM) [3], which is a classic unsupervised outlier detection method. Their approach has limited performance and has high computational complexity and memory requirements. In [4], the authors proposed to optimize the weights of an ensemble-based outlier detection method to fit labeled data. They restrained the final result to be the convex combination of base outlier detectors. Although this constraint keeps the ability of unsupervised method that can detect novel outliers, it also considerably limits the capacity of the model. In [5], the authors changed the convex combination to the general linear combination to increase model capacity. However, the general linear combination allows negative weights, which can deteriorate the performance of the combined detector. In [6], the authors alleviate this problem by using much more fine-grained base detectors. Although fine-grained base detectors can reduce the negative impact of negative weights, the problem still exists in some cases.
Motivated by the use of fine-grained base detectors in [6], we propose our semi-supervised outlier detection method. We push the granularity to the extreme: every point has a free parameter (i.e., label value). However, this setting has too much freedom. To get a meaningful result, we add two regulations: the graph structure and unsupervised outlier score. The graph is a directed nearest neighbor graph. For the unsupervised method, we choose Isolation Forest (iForest) [7] because of its excellent performance and low overhead. Other unsupervised methods can also be used. The algorithmic flow of our method is very simple due to the quadratic form of the loss function. Our method first obtains the outlier score from iForest and creates a nearest neighbor graph. Then, it fuses the labels and the outlier score. Subsequently, the outlier score and labels are spread over the graph using an iterative method based on label spreading algorithm [8].
Active learning is another form of learning task with a small proportion of labeled data that is closely related to semi-supervised learning. Labeled data are passively provided in semi-supervised learning, while it is actively acquired through an iterative form in active learning. Hence, the data query strategy is the key to active learning. General active learning research focuses on developing advanced data query strategies [9]. Unlike general active learning, the data query strategy in active outlier detection usually uses top selection, i.e., always choosing top-ranked samples to obtain labels from a human analyst/expert [5], [6], [10]. In [10] the authors demonstrated that top selection is an approximation of the uncertain sampling strategy that is widely used in general active learning. Hence, the key to active outlier detection also lies in how to do semi-supervised outlier detection. By incorporating the top selection query strategy, our semi-supervised method can be easily extended to active outlier detection.
We conduct comprehensive experiments for semi-supervised and active outlier detection on 12 public and real-world datasets. The results show that our semi-supervised outlier detection method is comparable with the best of state-of-the-art approaches, and our active outlier detection method outperforms state-of-the-art methods.
We summarize our major contributions as follows.
We propose a method for semi-supervised outlier detection based on an unsupervised outlier detection method and a graph-based semi-supervised learning method. Our major contribution is customizing the graph-based semi-supervised learning method for outlier detection.
We propose an active outlier detection method based on our semi-supervised outlier detection method.
We conduct extensive experiments that demonstrate the effectiveness of our semi-supervised and active outlier detection method.
The remainder of the paper is organized as follows. We review related work in Section II. We introduce the problem settings and the preliminaries of this study in Section III. We describe the proposed methods in Section IV. We empirically evaluate the methods in Section V. Finally, we conclude this study in Section VI.
Related Work
A. Unsupervised Outlier Detection
A considerable amount of literature on unsupervised outlier detection has been published [1], [11]. Unsupervised outlier detection methods are typically based on certain assumptions. The most widely used assumption is the low-density assumption, which assumes that outliers reside in the low-density area of input space. Methods based on this assumption frequently involve explicit or implicit density estimation. Although explicit density estimation methods, such as statistic model-based method and kernel density estimation, can be used directly, they are not optimized for outlier detection, which only concerns with the low-density area. Hence, many outlier detection methods adopt an implicit fashion. Nearest neighbor-based methods [12] can be explained as methods that use nearest neighbors to approximate density. To deal with the case in which normal data instances have different densities, the local outlier factor (LOF) [13] is proposed. LOF alleviates this problem by comparing the density with those of neighbors. iForest [7] uses tree depth of random space partition as the measure of outlier degree. The tree depth is monotonic with respect to density. OC-SVM [3] constructs a hyperplane to separate most data instances from a few potential outliers; however, it is essentially an asymptotically consistent density level set estimator when using Gaussian kernels [14]. Another commonly used assumption is cluster assumption, which assumes that normal data form clusters and data points that are far from clusters or that belong to small clusters can be considered outliers [15]. As outliers are only by-products of clustering methods and clustering methods are not optimized for outlier detection, the efficiency and effectiveness of clustering-based methods are relatively low.
Recent studies [16]–[18] have attempted to combine deep learning and traditional outlier detection. However, these methods are also unsupervised, and thus they cannot be used directly in semi-supervised outlier detection.
B. Semi-Supervised and Active Outlier Detection
Compared with those on unsupervised outlier detection, studies on semi-supervised outlier detection are relatively few. In [19], the authors proposed a mixture model-based method. However, this method is limited because it requires assuming the distribution of normal data and outliers, which is difficult to determine. In [3], the authors provided an extension of OC-SVM to consider labeled data. However, this method can only use labeled outliers, while labeled normal data are neglected. In [2], the authors also proposed a method based on OC-SVM that can consider both labeled normal data and outliers. Similar to OC-SVM, the two previous OC-SVM-based methods suffer from high computational and memory complexity. In [20], the authors proposed a cluster-based semi-supervised outlier detection method for water analytics.
Another category of semi-supervised outlier detection methods is based on the ensemble outlier detection [4]–[6]. These methods first construct a number of base outlier detectors and then adjust the combination weights of the base detectors to fit available labels. In [5], the authors constrained the weights to be a convex combination, which has a limited model capacity. In [4], the authors used the base detector of lightweight on-line detect of anomalies (LODA) [21] and relaxed convex combination to general linear combination. Although such relaxation can increase model capacity, it may deteriorate the performance of the combined detector if negative weights are too many. Hence, this method requires careful parameter settings to avoid this issue. In [6], the authors alleviated this problem by using fine-grained base detectors, the number of detectors was also increased considerably. A fine-grained detector can reduce the negative influence caused by negative weights. In [22], the authors adopted the same approaches as those in [6]; however, they also proposed an online optimization algorithm to accelerate computation. Our method is also fine-grained: every point can be considered a base detector. Moreover, our method can spread label values to its nearest neighbors.
For data query strategies in active outlier detection, recent studies have shown that the top selection strategy is effective [5], [6], [10]. This strategy greedily selects the most abnormal data, i.e., data ranked at top according to an outlier score. In [2], the authors proposed another strategy called margin and cluster, which has been demonstrated to be inferior to the top selection strategy [5]. In our active outlier detection method, we also adopt the top selection strategy.
C. Semi-Supervised Learning
Semi-supervised learning is a well-studied problem [23], [24]. The graph-based method is a classic approach to semi-supervised learning. Graph-based methods first define a graph wherein the nodes are data points and edges (generally weighted) represent similarities between data points. Then, labels are assigned to unlabeled data points using the graph and available labeled data points. Graph-based methods can be broadly categorized as random walk-based methods [25]–[27] and iterative methods [8], [28], [29]. These methods can also be interpreted as the minimization of a quadratic cost function defined by a graph and labeled data points [24].
Besides graph-based methods, other semi-supervised learning methods include generative models, low-density separation methods, and other heuristic approaches. Recently, the developments of deep generative models facilitate their use in semi-supervised learning. For examples, in [30] and [31], the authors use variational autoencoders; and in [32] and [33], the authors use generative adversarial nets. However, the general semi-supervised methods typically assume that labeled data are i.i.d with unlabeled data and the data is class balanced, thus these methods cannot be used directly in semi-supervised outlier detection that has novel types of outliers and is inherent class imbalanced. Few-shot learning [34], [35], [35] is also a learning task with few labeled data. However, in the problem settings of few-shot learning, only the target classes are provided with few-shot of data while other classes are provided with a large number of labeled data. Hence, it is different from our problem settings, and these methods also cannot be used directly in semi-supervised outlier detection.
Problem Settings and Preliminaries
In this section, we first introduce the problem settings for this study. Then, we provide a brief introduction to the preliminaries of our methods.
A. Problem Settings
Let
B. Preliminaries
1) Label Spreading
Label spreading [8] is a graph-based semi-supervised learning method for multi-class classification. The basic idea of this algorithm is to iteratively spread each point’s label information to its neighbors until a global stable state is achieved [8]. The iteration algorithm of label spreading is illustrated in Algorithm 1.
Algorithm 1 Label Spreading [8]
Compute the affinity matrix
Compute the diagonal degree matrix
Compute the spreading matrix
Iterate
For the undirected graph, i.e., \begin{align*} \mathcal {L}(\boldsymbol {f})=&\frac {1}{2} \sum _{i, j} w_{ij} \left({\frac {f_{i}}{\sqrt {d_{i}}} - \frac {f_{j}}{\sqrt {d_{j}}} }\right)^{2} + \mu || \boldsymbol {f} - \boldsymbol {f}^{(0)} ||^{2} \tag{1}\\=&\boldsymbol {f} ^{T} \mathbf {L} \boldsymbol {f} + \mu || \boldsymbol {f} - \boldsymbol {f}^{(0)}||^{2},\tag{2}\end{align*}
2) iForest
iForest [7] is a state-of-the-art outlier detection method that exhibits high performance in outlier detection and is parameter-free. iForest consists of a number of trees. Each tree is constructed by uniformly random selecting a feature and a threshold to partition the data until only one data point is left (isolation). The path length from the root to leave (\begin{equation*} f_{s}(\mathbf {x}) = 2^{-E[h(\mathbf {x})]/c(n)}, \tag{3}\end{equation*}
Proposed Method
In this section, we first describe our semi-supervised outlier detection method. Then, we introduce our active outlier detection method based on our semi-supervised outlier detection method.
A. Semi-Supervised Outlier Detection Method
We design our semi-supervised outlier detection method based on label spreading and iForest. The main idea of our method is using a graph to smoothly fit the unsupervised outlier score and available labels. The fitting process can ensure the following goals: 1) keep the labels of labeled data, 2) densely connected points obtain similar values, and 3) consistent with the unsupervised outlier scores of most unlabeled data. From the perspective of supervised learning, our method can be regarded as doing a regression with partial labels and unsupervised outlier scores as the target and a graph structure as the regulation. From the perspective of semi-supervised learning, our method can be regarded as performing semi-supervised regression using a graph-based method with two regulations: a graph structure and an unsupervised outlier score list, which corresponds to the two terms of the quadratic loss (Equation 1).
Compared with the original label spreading algorithm, we make the following extensions to adopt the problem of outlier detection:
Instead of a full connected graph, we use a
-nearest neighbor ({k} NN) graph which can make our method scale to large dataset. We also provide an empirical method to set the parameter of Gaussian kernel.{k} We firstly transform the outlier score to soft pseudo-labels, and then apply the regularization of pulling the values of unlabeled data to the pseudo-labels. However, in original label spreading method, the label values of unlabeled data are pulled to 0. We also provide a method to fuse the pseudo-labels and true labels.
We give a large
(Equation 1) for labeled data to make them keep their initial label values. This is important as label data are few; and they generally have low node degree in our problem setting, which will cause them to deviate significantly from the initial value if use the same\mu .\mu
The algorithmic flow of our method is shown in Algorithm 2. In this algorithm, \begin{equation*} w_{ij} = e^{- \frac {|| \mathbf {x}_{i} - \mathbf {x}_{j}||^{2}}{2 \sigma ^{2} }}, \tag{4}\end{equation*}
\begin{equation*} f_{i}^{(t+1)} = \alpha _{i} \sum _{j} \frac {w_{ij}}{\sqrt {d_{i}}\sqrt {d_{j}}} f_{j}^{(t)} + (1-\alpha _{i}) f_{i}^{(0)}. \tag{5}\end{equation*}
Algorithm 2 Graph-Based Semi-Supervised Outlier Detection (GSSOD)
Compute the outlier score of iForest
Compute the diagonal degree matrix
Compute the spreading matrix
Iterate
The space complexity of GSSOD is
B. Active Outlier Detection Method
Active and semi-supervised learning can both learn with partially labeled data. However, labeled data are passively provided in semi-supervised learning, but actively selected in active learning. Hence, the data selection strategy is crucial in active learning [9]. One commonly used data selection strategy is uncertain sampling, which selects samples that the current model is least certain about with regard to what the correct output should be [36]. These samples are typically near the classification hyperplane.
For outlier detection, however, we generally cannot get a reliable classification hyperplane due to the insufficient sampling of the outlier class. Hence, uncertain sampling strategy is also unreliable in active outlier detection. Nevertheless, an effective query strategy is available for active outlier detection, namely, top selection, which always selects the top-ranked data to query [5]. Since for outlier detection, the potential classification hyperplane should be near abnormal data, top selection is an approximation of uncertain sampling [10]. Moreover, this strategy is also suitable for the application scenario of outlier detection in which experts generally analyze the top-ranked data in accordance with the outlier score.
By incorporating the query strategy of top selection, our semi-supervised outlier detection method can be easily extended to an active outlier detection method. The flowchart of our proposed graph-based active outlier detection (GAOD) method is shown in Figure 1. The details of GAOD are shown in Algorithm 3. Lines 1-3 are the same as those in GSSOD 2. We first compute the outlier score of iForest, then transform it to soft pseudo-labels by subtracting 0.5. The targets of labeled data are set to the maximum and minimum value of soft pseudo-labels. Lines 4-5 are the same as those in label spreading. We describe the query iteration in lines 9-14. Compared with GSSOD, the major modification is that we use the convergent
Algorithm 3 Graph-Based Active Outlier Detection (GAOD)
Compute the outlier score of iForest
Compute the diagonal degree matrix
Compute the spreading matrix
while
Select top
Obtain labels of
Iterate
end while
Experiments and Results
In this section, we will briefly introduce the data sets and the compared methods which are used in the experiments. Afterwards, experimental results are evaluated and analyzed.
A. Experimental Settings
Datasets. We use 12 real-world data sets in our experiments. The summary of these datasets is shown in Table 1. Abalone, Human Activity Recognition (HAR), Satellite, and Seismic are from UCI Machine Learning Repository [37]. MNIST [38] is the famous handwritten digit dataset. Cardio, Covertype, Mammography, Optdigits, Pendigits, and Shuttle are taken from the Outlier Detection Data Sets (ODDS) [39]. These datasets are originally used for classification. Following the paradigm in most outlier detection studies [1], the following transformation is applied to generate data for outlier detection. For datasets with unbalanced classes, the majority classes are used as normal data, while the minority classes are used as outliers. For dataset with balanced classes, several classes are uniformly downsampled to create minority classes, the sampled data are used as outliers. The other classes that were not downsampled are used as normal data.
The datasets with an asterisk in Table 1 are the original dataset. We uniformly downsample them because of the high time and space complexity of the baseline methods. The time and space complexity of SSAD [2] are both
Baselines. We compare our method with the following baselines.
SSC is a semi-supervised classification method based on the original label spreading algorithm [8]. For outlier detection, SSC is configured as a two-class classification. Given that the value of
indicates affinity to a specific class and hard labels are derived using theF function, we set the outlier score of SSC as\arg \max . ParameterF[:, 1] - F[:, 0] is set as 0.99, which was also used in [8].\alpha SSAD [2] is a state-of-the-art semi-supervised outlier detection method based on OC-SVM [3]. SSAD uses the radial basis function (RBF) kernel with parameter
, which is in accord with multidimensional Gaussian distribution. The other parameters use their default settings:2~m \sigma _{X}^{2} ,C_{p}=1.0 ,C_{n}=1.0 , andC_{u}=1.0 . For active outlier detection, we only use the top selection strategy, because this strategy has been demonstrated to be more effective than the combination strategy called margin and cluster [5]. We use the procedure provided by the authors.1\kappa = 1.0 EAAD-L [5] is an ensemble-based active outlier detection approach that uses LODA [21] to create the base detectors. We adopt the public implementation provided by the authors2. The parameters are set as the recommended values in [5], i.e.,
,\tau =0.03 , andC_{A} = 100 . The original EAAD-L is for active outlier detection, and uses an alternate optimization method to compute ensemble weights. The alternation is performed only once in an active query iteration. For the semi-supervised experiments, we modify the number of alternations to 100, with an early stopping when the ensemble weights are convergent. We find that the alternative optimization method used in EAAD-L does not converge in several datasets.C_{\xi } = 1000 EAAD-T [6] is an ensemble-based active outlier detection approach that uses iForest tree nodes as base detectors. EAAD-T is the subsequent work of the same group of researchers. Compared with EAAD-L, EAAD-T is more fine-grained. In this regard, EAAD-T is the most similar to our method. The implementation procedure and parameter settings are the same as those in EAAD-L.
FBiForest [22] also uses the nodes of iForest as base detectors, which is the same as EAAD-T. However, this method adopted an online convex optimization method to optimize the ensemble weights. Hence, we only compare with FBiForest in active outlier detection. We adopt the public implementation provided by the authors.3 The parameters are set as the recommended values in [22].
Evaluation Measurement. We use the area under the receiver operating characteristic (ROC) curve (AUC) as the metric for evaluating the performance of outlier detection. ROC curves plot the true positive rate against the false positive rate. Intuitively, AUC measures the rank accuracy of placing outliers ahead of normal data; this practice is extensively adopted in outlier detection research [1]. For active outlier detection, we also compare the number of true outliers queried by each method.
Parameter Settings. Our method has three parameters: number of nearest neighbors
B. Semi-Supervised Outlier Detection
For the labeled data of semi-supervised outlier detection, we assume that they are from an expert’s feedback of the top-ranked points detected by an unsupervised outlier detection method. We choose iForest for this experiment. To ensure the comparability, every semi-supervised method is provided the same labeled data. The number of labeled data is selected in accordance with the number of true outliers in each dataset. Considering the space limit, We only report the results of given
As shown in Table 2 and 3, the performance of GSSOD is better than or comparable with the best of the baselines. The supervised method SSC is unreliable. Although it obtains the highest AUCs for several datasets, other datasets are even worse than a random guess. Among all the semi-supervised outlier detection methods, SSAD is the worst. EAAD-T is better than EAAD-L, because EAAD-T is more fine-grained. Our method is also fine-grained. Although EAAD-T achieves better performance than GSSOD in 4 out of 12 datasets, rejecting the null hypothesis is still significant when applying the one-sided Wilcoxon signed rank test between EAAD-T and GSSOD at a significance level of 0.05. The results of the statistical tests are presented in Table 4. The results also show that our method is better than EAAD-T with more confidence when the number of labeled data is less.
Given that
C. Active Outlier Detection
In the experiments for active outlier detection, we set the budget
Learning curves of the first 6 datasets: AUC vs. query iterations and number of true outliers vs. query iterations. For stochastic methods, we run experiments 10 times and only report the mean for clarity.
Learning curves of the last 6 datasets: AUC vs. query iterations and number of true outliers vs. query iterations. For stochastic methods, we run experiments 10 times and only report the mean for clarity.
As indicated in the figure, our method achieves the best performance in both measurements compared with the baselines for most datasets. When measured by the number of queried true outliers, FBiForest is the closest to our method. However, FBiForest suffers a large drop of AUC in many datasets as obtained more labeled data, and other baselines have the same problem. This result shows that our method can better balance supervised information and unsupervised information. Hence, our method can induce better detection model when given the same query budget. For datasets Mammography and Satellite, the baselines are better than our method when measured by AUC. We believe that this result can be attributed to these methods being discriminative and our method being data-based; hence, more fine-grained. For a discriminative model, a slight change will considerably affect the final result. In some cases, the change may achieve considerable performance improvement; however, the impact may also be negative. The same reason explains why our method is more stable. Compared with the occasional performance improvement, we believe that avoiding negative cases is more important for outlier detection.
Conclusion
In this paper, we first propose a graph-based semi-supervised method (GSSOD), then we propose an active outlier detection method based GSSOD. GSSOD adds a mechanism to incorporate labeled data into an unsupervised outlier detection method. Although the unsupervised outlier detection method we used is iForest, other methods can also be adopted. Extensive experiments show that our semi-supervised outlier detection method is comparable with the best of state-of-the-art approaches, and our active outlier detection method outperforms state-of-the-art methods in terms of AUC and the number of true outliers queried. For our future work, we will consider using recently proposed graph convolutional networks [40]–[42] in semi-supervised outlier detection. We will also develop an advanced data query strategy for active outlier detection. Given that the active outlier detection used in this study is pool-based, we will also consider extending our method to the stream setting.