Processing math: 100%
A Graph-Based Method for Active Outlier Detection With Limited Expert Feedback | IEEE Journals & Magazine | IEEE Xplore

A Graph-Based Method for Active Outlier Detection With Limited Expert Feedback


Flowchart of the proposed graph-based active outlier detection method.

Abstract:

Labeled data, particularly for the outlier class, are difficult to obtain. Thus, outlier detection is typically regarded as an unsupervised learning problem. However, it ...Show More

Abstract:

Labeled data, particularly for the outlier class, are difficult to obtain. Thus, outlier detection is typically regarded as an unsupervised learning problem. However, it still has an opportunity to obtain few labeled data. For example, a human analyst can give feedback to few data when he/she examines the results of an unsupervised outlier detection method. Moreover, the widely used unsupervised method for outlier detection cannot only take the labeled data into consideration nor use them properly. In this study, we first propose a graph-based method to endow the unsupervised method with the ability to consider few labeled data. Then, we extend our semi-supervised method to active outlier detection by incorporating the query strategy that selects top-ranked outliers. Comprehensive experiments on 12 real-world datasets demonstrate that our semi-supervised outlier detection method is comparable with the best of state-of-the-art approaches, and our active outlier detection method outperforms state-of-the-art methods.
Flowchart of the proposed graph-based active outlier detection method.
Published in: IEEE Access ( Volume: 7)
Page(s): 152267 - 152277
Date of Publication: 16 October 2019
Electronic ISSN: 2169-3536

Funding Agency:

No metrics found for this document.

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Outlier detection is a classic problem in data mining with many applications, such as network intrusion detection, environmental monitoring, fraud detection, etc. Most studies on outlier detection adopt unsupervised method [1] primarily because the insufficient sampling of outlier class makes determining the classification hyperplane between normal data and outliers infeasible. Hence, supervised methods are disregarded and unsupervised methods are adopted. The latter is generally based on specific assumptions, such as low-density or cluster assumptions. The low-density assumption assumes that outliers are always located in the low-density area of input space, and cluster assumption typically assumes that normal data form clusters and non-cluster data tend to be outliers. However, these preset assumptions are not always in accord with the semantic outlier. When these assumptions cannot match, a high false positive rate will result. Another disadvantage of unsupervised outlier detection methods is that they cannot consider labeled data. In real-world applications of outlier detection, few labeled data can still be obtained. For example, a human analyst can give labels to partial data when he/she examines the results of an unsupervised outlier detection method. Thus, a mechanism that considers label information is necessary when applying unsupervised methods.

When few labeled data are provided, the problem is transformed into a semi-supervised classification problem. However, traditional semi-supervised classification methods suffer from the same problem as supervised methods when applied to outlier detection. Thus, they cannot be used directly; moreover, it has been demonstrated that semi-supervised outlier detection should be based on an unsupervised outlier detection method [2]. Several attempts have been made to address this problem. Most of these attempts can be regarded as based on unsupervised outlier detection methods. In [2], the authors added the constraint of labeled data to one-class support vector machine (OC-SVM) [3], which is a classic unsupervised outlier detection method. Their approach has limited performance and has high computational complexity and memory requirements. In [4], the authors proposed to optimize the weights of an ensemble-based outlier detection method to fit labeled data. They restrained the final result to be the convex combination of base outlier detectors. Although this constraint keeps the ability of unsupervised method that can detect novel outliers, it also considerably limits the capacity of the model. In [5], the authors changed the convex combination to the general linear combination to increase model capacity. However, the general linear combination allows negative weights, which can deteriorate the performance of the combined detector. In [6], the authors alleviate this problem by using much more fine-grained base detectors. Although fine-grained base detectors can reduce the negative impact of negative weights, the problem still exists in some cases.

Motivated by the use of fine-grained base detectors in [6], we propose our semi-supervised outlier detection method. We push the granularity to the extreme: every point has a free parameter (i.e., label value). However, this setting has too much freedom. To get a meaningful result, we add two regulations: the graph structure and unsupervised outlier score. The graph is a directed nearest neighbor graph. For the unsupervised method, we choose Isolation Forest (iForest) [7] because of its excellent performance and low overhead. Other unsupervised methods can also be used. The algorithmic flow of our method is very simple due to the quadratic form of the loss function. Our method first obtains the outlier score from iForest and creates a nearest neighbor graph. Then, it fuses the labels and the outlier score. Subsequently, the outlier score and labels are spread over the graph using an iterative method based on label spreading algorithm [8].

Active learning is another form of learning task with a small proportion of labeled data that is closely related to semi-supervised learning. Labeled data are passively provided in semi-supervised learning, while it is actively acquired through an iterative form in active learning. Hence, the data query strategy is the key to active learning. General active learning research focuses on developing advanced data query strategies [9]. Unlike general active learning, the data query strategy in active outlier detection usually uses top selection, i.e., always choosing top-ranked samples to obtain labels from a human analyst/expert [5], [6], [10]. In [10] the authors demonstrated that top selection is an approximation of the uncertain sampling strategy that is widely used in general active learning. Hence, the key to active outlier detection also lies in how to do semi-supervised outlier detection. By incorporating the top selection query strategy, our semi-supervised method can be easily extended to active outlier detection.

We conduct comprehensive experiments for semi-supervised and active outlier detection on 12 public and real-world datasets. The results show that our semi-supervised outlier detection method is comparable with the best of state-of-the-art approaches, and our active outlier detection method outperforms state-of-the-art methods.

We summarize our major contributions as follows.

  • We propose a method for semi-supervised outlier detection based on an unsupervised outlier detection method and a graph-based semi-supervised learning method. Our major contribution is customizing the graph-based semi-supervised learning method for outlier detection.

  • We propose an active outlier detection method based on our semi-supervised outlier detection method.

  • We conduct extensive experiments that demonstrate the effectiveness of our semi-supervised and active outlier detection method.

The remainder of the paper is organized as follows. We review related work in Section II. We introduce the problem settings and the preliminaries of this study in Section III. We describe the proposed methods in Section IV. We empirically evaluate the methods in Section V. Finally, we conclude this study in Section VI.

SECTION II.

Related Work

A. Unsupervised Outlier Detection

A considerable amount of literature on unsupervised outlier detection has been published [1], [11]. Unsupervised outlier detection methods are typically based on certain assumptions. The most widely used assumption is the low-density assumption, which assumes that outliers reside in the low-density area of input space. Methods based on this assumption frequently involve explicit or implicit density estimation. Although explicit density estimation methods, such as statistic model-based method and kernel density estimation, can be used directly, they are not optimized for outlier detection, which only concerns with the low-density area. Hence, many outlier detection methods adopt an implicit fashion. Nearest neighbor-based methods [12] can be explained as methods that use nearest neighbors to approximate density. To deal with the case in which normal data instances have different densities, the local outlier factor (LOF) [13] is proposed. LOF alleviates this problem by comparing the density with those of neighbors. iForest [7] uses tree depth of random space partition as the measure of outlier degree. The tree depth is monotonic with respect to density. OC-SVM [3] constructs a hyperplane to separate most data instances from a few potential outliers; however, it is essentially an asymptotically consistent density level set estimator when using Gaussian kernels [14]. Another commonly used assumption is cluster assumption, which assumes that normal data form clusters and data points that are far from clusters or that belong to small clusters can be considered outliers [15]. As outliers are only by-products of clustering methods and clustering methods are not optimized for outlier detection, the efficiency and effectiveness of clustering-based methods are relatively low.

Recent studies [16]–​[18] have attempted to combine deep learning and traditional outlier detection. However, these methods are also unsupervised, and thus they cannot be used directly in semi-supervised outlier detection.

B. Semi-Supervised and Active Outlier Detection

Compared with those on unsupervised outlier detection, studies on semi-supervised outlier detection are relatively few. In [19], the authors proposed a mixture model-based method. However, this method is limited because it requires assuming the distribution of normal data and outliers, which is difficult to determine. In [3], the authors provided an extension of OC-SVM to consider labeled data. However, this method can only use labeled outliers, while labeled normal data are neglected. In [2], the authors also proposed a method based on OC-SVM that can consider both labeled normal data and outliers. Similar to OC-SVM, the two previous OC-SVM-based methods suffer from high computational and memory complexity. In [20], the authors proposed a cluster-based semi-supervised outlier detection method for water analytics.

Another category of semi-supervised outlier detection methods is based on the ensemble outlier detection [4]–​[6]. These methods first construct a number of base outlier detectors and then adjust the combination weights of the base detectors to fit available labels. In [5], the authors constrained the weights to be a convex combination, which has a limited model capacity. In [4], the authors used the base detector of lightweight on-line detect of anomalies (LODA) [21] and relaxed convex combination to general linear combination. Although such relaxation can increase model capacity, it may deteriorate the performance of the combined detector if negative weights are too many. Hence, this method requires careful parameter settings to avoid this issue. In [6], the authors alleviated this problem by using fine-grained base detectors, the number of detectors was also increased considerably. A fine-grained detector can reduce the negative influence caused by negative weights. In [22], the authors adopted the same approaches as those in [6]; however, they also proposed an online optimization algorithm to accelerate computation. Our method is also fine-grained: every point can be considered a base detector. Moreover, our method can spread label values to its nearest neighbors.

For data query strategies in active outlier detection, recent studies have shown that the top selection strategy is effective [5], [6], [10]. This strategy greedily selects the most abnormal data, i.e., data ranked at top according to an outlier score. In [2], the authors proposed another strategy called margin and cluster, which has been demonstrated to be inferior to the top selection strategy [5]. In our active outlier detection method, we also adopt the top selection strategy.

C. Semi-Supervised Learning

Semi-supervised learning is a well-studied problem [23], [24]. The graph-based method is a classic approach to semi-supervised learning. Graph-based methods first define a graph wherein the nodes are data points and edges (generally weighted) represent similarities between data points. Then, labels are assigned to unlabeled data points using the graph and available labeled data points. Graph-based methods can be broadly categorized as random walk-based methods [25]–​[27] and iterative methods [8], [28], [29]. These methods can also be interpreted as the minimization of a quadratic cost function defined by a graph and labeled data points [24].

Besides graph-based methods, other semi-supervised learning methods include generative models, low-density separation methods, and other heuristic approaches. Recently, the developments of deep generative models facilitate their use in semi-supervised learning. For examples, in [30] and [31], the authors use variational autoencoders; and in [32] and [33], the authors use generative adversarial nets. However, the general semi-supervised methods typically assume that labeled data are i.i.d with unlabeled data and the data is class balanced, thus these methods cannot be used directly in semi-supervised outlier detection that has novel types of outliers and is inherent class imbalanced. Few-shot learning [34], [35], [35] is also a learning task with few labeled data. However, in the problem settings of few-shot learning, only the target classes are provided with few-shot of data while other classes are provided with a large number of labeled data. Hence, it is different from our problem settings, and these methods also cannot be used directly in semi-supervised outlier detection.

SECTION III.

Problem Settings and Preliminaries

In this section, we first introduce the problem settings for this study. Then, we provide a brief introduction to the preliminaries of our methods.

A. Problem Settings

Let \mathcal {X} \subseteq \mathbb {R}^{m} denote the input space in which m is the input dimension. For unsupervised outlier detection, given an unlabeled point set \mathbf {D}_{u} = \{ \mathbf {x}_{1}, \ldots, \mathbf {x}_{n} \} , the aim is to learn an outlier scoring function f_{s}: \mathcal {X} \to \mathbb {R} . The value of f_{s}(\mathbf {x}) indicates the outlier degree of data point \mathbf {x} . The larger the value of f_{s}(\mathbf {x}) , the higher the possibility that \mathbf {x} is an outlier. Thus, the outlier score can provide the priority of data processing for a human analyst. This property is important in outlier detection applications. For semi-supervised outlier detection, given a partially labeled dataset \mathbf {D}_{s} = \{(\mathbf {x}_{1}, y_{1}), \ldots, (\mathbf {x}_{l}, y_{l}), \mathbf {x}_{l+1}, \ldots, \mathbf {x}_{n} \} , the objective is the same as unsupervised outlier detection to learning an outlier scoring function. For active outlier detection, we assume that an expert/human analyst can assign labels to total B points with batch b at every query iteration. After obtaining the labels at each iteration, the problem is transformed into a semi-supervised outlier detection problem.

B. Preliminaries

1) Label Spreading

Label spreading [8] is a graph-based semi-supervised learning method for multi-class classification. The basic idea of this algorithm is to iteratively spread each point’s label information to its neighbors until a global stable state is achieved [8]. The iteration algorithm of label spreading is illustrated in Algorithm 1. \mathbf {F}^{(0)} which is one input of the algorithm, consists of one hot target for labeled data and zeros for unlabeled data. The final labels are obtained by using arg max function over every row of \mathbf {F} .

Algorithm 1 Label Spreading [8]

Input:

\mathbf {X}, \mathbf {F}^{(0)}, \sigma, \alpha \in (0, 1)

Output:

\mathbf {F}

1:

Compute the affinity matrix \mathbf {W} : w_{i,j} \gets \exp \left({- \frac {|| \mathbf {x}_{i} - \mathbf {x}_{j}||^{2}}{2 \sigma ^{2} }}\right) for i \neq j (and w_{ii} \gets 0 )

2:

Compute the diagonal degree matrix \mathbf {D} by d_{ii} \gets \sum _{j} w_{ij}

3:

Compute the spreading matrix \mathbf {S} = \mathbf {D}^{-1/2} \mathbf {W} \mathbf {D}^{-1/2}

4:

Iterate \mathbf {F}^{(t+1)} \gets \alpha \mathbf {S} \mathbf {F}^{(t)} + (1-\alpha) \mathbf {F}^{(0)} until convergence

For the undirected graph, i.e., \mathbf {W} is symmetric, the label spreading algorithm is equivalent to minimizing the following quadratic loss function:\begin{align*} \mathcal {L}(\boldsymbol {f})=&\frac {1}{2} \sum _{i, j} w_{ij} \left({\frac {f_{i}}{\sqrt {d_{i}}} - \frac {f_{j}}{\sqrt {d_{j}}} }\right)^{2} + \mu || \boldsymbol {f} - \boldsymbol {f}^{(0)} ||^{2} \tag{1}\\=&\boldsymbol {f} ^{T} \mathbf {L} \boldsymbol {f} + \mu || \boldsymbol {f} - \boldsymbol {f}^{(0)}||^{2},\tag{2}\end{align*}

View SourceRight-click on figure for MathML and additional features. where \mathbf {L} = \mathbf {I} - \mathbf {D}^{-1/2} \mathbf {W} \mathbf {D}^{-1/2} = \mathbf {I} - \mathbf {S} , \mu >0 . The first part of \mathcal {L} is a regulation term to make \boldsymbol {f} consistent with the graph structure. The second term of the loss function makes \boldsymbol {f} consistent with the initial labels. \mu controls the balance between the two terms of \mathcal {L} , and the relationship with \alpha in Algorithm 1 is \alpha = 1/(1+\mu) . Notably, the initial labels for the unlabeled data are set to zeros, and the second term can be factored as || \boldsymbol {f} - \boldsymbol {f}^{(0)}||^{2} = || \boldsymbol {f}_{l} - \boldsymbol {f}_{l}^{(0)}||^{2} + || \boldsymbol {f}_{u}||^{2} . Thus, the minimization will drive the value of f_{i} toward zero for unlabeled data.

2) iForest

iForest [7] is a state-of-the-art outlier detection method that exhibits high performance in outlier detection and is parameter-free. iForest consists of a number of trees. Each tree is constructed by uniformly random selecting a feature and a threshold to partition the data until only one data point is left (isolation). The path length from the root to leave (h(\mathbf {x}) ) is used to compute the outlier score. This score is basically in proportion to the probability density, and thus iForest is based on low-density assumption. The formulation of iForest’s outlier score is \begin{equation*} f_{s}(\mathbf {x}) = 2^{-E[h(\mathbf {x})]/c(n)}, \tag{3}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where c(n) is the average path length of an unsuccessful search in a binary search tree and n is the number of data used in the tree. With c(n) and the exponential function, the outlier score is squashed to (0,1) . c(n) is also an estimation of the path length of uniformly distributed data. When the value of f_{s}(\mathbf {x}) is larger than 0.5, data point \mathbf {x} can be a potential outlier.

SECTION IV.

Proposed Method

In this section, we first describe our semi-supervised outlier detection method. Then, we introduce our active outlier detection method based on our semi-supervised outlier detection method.

A. Semi-Supervised Outlier Detection Method

We design our semi-supervised outlier detection method based on label spreading and iForest. The main idea of our method is using a graph to smoothly fit the unsupervised outlier score and available labels. The fitting process can ensure the following goals: 1) keep the labels of labeled data, 2) densely connected points obtain similar values, and 3) consistent with the unsupervised outlier scores of most unlabeled data. From the perspective of supervised learning, our method can be regarded as doing a regression with partial labels and unsupervised outlier scores as the target and a graph structure as the regulation. From the perspective of semi-supervised learning, our method can be regarded as performing semi-supervised regression using a graph-based method with two regulations: a graph structure and an unsupervised outlier score list, which corresponds to the two terms of the quadratic loss (Equation 1).

Compared with the original label spreading algorithm, we make the following extensions to adopt the problem of outlier detection:

  • Instead of a full connected graph, we use a {k} -nearest neighbor ({k} NN) graph which can make our method scale to large dataset. We also provide an empirical method to set the parameter of Gaussian kernel.

  • We firstly transform the outlier score to soft pseudo-labels, and then apply the regularization of pulling the values of unlabeled data to the pseudo-labels. However, in original label spreading method, the label values of unlabeled data are pulled to 0. We also provide a method to fuse the pseudo-labels and true labels.

  • We give a large \mu (Equation 1) for labeled data to make them keep their initial label values. This is important as label data are few; and they generally have low node degree in our problem setting, which will cause them to deviate significantly from the initial value if use the same \mu .

The algorithmic flow of our method is shown in Algorithm 2. In this algorithm, \mathbf {W} is the adjacency matrix of the {k} NN graph. Notably, the graph we used is directed. If \mathbf {x}_{j} (j \neq i ) is a neighbor of \mathbf {x}_{i} , then an edge exists from \mathbf {x}_{i} to \mathbf {x}_{j} , and the weight of the edge is computed by \begin{equation*} w_{ij} = e^{- \frac {|| \mathbf {x}_{i} - \mathbf {x}_{j}||^{2}}{2 \sigma ^{2} }}, \tag{4}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \sigma is the standard deviation of the Gaussian function, which controls the decay of weight as distance increases. We suggest setting \sigma to half of the 95-percentile of k-th nearest neighbor distances. The purpose of this setting is to make the distribution of values in \mathbf {W} balanced. The algorithm first computes the outlier score of iForest \boldsymbol {f}_{s} and then transform it to pseudo-labels \boldsymbol {f}^{(0)} by subtracting 0.5. To fuse the pseudo-labels and true labels, we set the values of labeled normal data to the minimum of \boldsymbol {f}^{(0)} , and the values of labeled outliers to the maximum (Lines 3-4). Lines 5-6 is the same as those in label spreading. Lines 7-8 differ from those in label spreading by modifying the \alpha of labeled data to 1-\alpha . \alpha is generally set to near 1; the authors of [8] used 0.99 in their experiments. As illustrated in previous subsection, \alpha = 1/(1+\mu) . Thus, the smaller values of \alpha correspond to larger punishment for the deviation from the initial values. The update equation at each iteration of every point is as follows:\begin{equation*} f_{i}^{(t+1)} = \alpha _{i} \sum _{j} \frac {w_{ij}}{\sqrt {d_{i}}\sqrt {d_{j}}} f_{j}^{(t)} + (1-\alpha _{i}) f_{i}^{(0)}. \tag{5}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
As shown in this equation, the new value is derived from the linear combination of neighbor’s value and the initial value. When \alpha is smaller, the initial value will have more contribution. We set the labeled data using small \alpha to make them keep their initial values. Notably, the coefficients of the linear combination are normalized by \sqrt {d_{i}}\sqrt {d_{j}} instead of d_{i} . Compared with the latter, the former normalization method can reduce neighbor’s impact on potential outliers because in the relationship of nearest neighbors, an outlier can have many normal neighbors.

Algorithm 2 Graph-Based Semi-Supervised Outlier Detection (GSSOD)

Input:

\mathbf {X}, \mathbf {y}=[y_{1}, \ldots, y_{l}], \mathbf {W}, \alpha \in (0, 1)

Output:

\boldsymbol {f}

1:

Compute the outlier score of iForest \boldsymbol {f}_{s} using Equation 3

2:

\boldsymbol {f}^{(0)} \gets \boldsymbol {f}_{s} - 0.5

3:

t_{o} \gets \max (\boldsymbol {f}^{(0)}) ; t_{n} \gets \min (\boldsymbol {f}^{(0)})

4:

\boldsymbol {f}^{(0)} [\mathbf {y} = 1] \gets t_{o} ; \boldsymbol {f}^{(0)} [\mathbf {y} = 0] \gets t_{n}

5:

Compute the diagonal degree matrix \mathbf {D} by d_{ii} \gets \sum _{j} w_{ij}

6:

Compute the spreading matrix \mathbf {S} = \mathbf {D}^{-1/2} \mathbf {W} \mathbf {D}^{-1/2}

7:

\boldsymbol {\alpha } \gets [1-\alpha, \ldots, 1-\alpha, \alpha, \ldots, \alpha]^{T}

8:

Iterate \boldsymbol {f}^{(t+1)} \gets \boldsymbol {\alpha } \odot (\mathbf {S} \boldsymbol {f}^{(t)}) + (1- \boldsymbol {\alpha }) \odot \boldsymbol {f}^{(0)} until convergence

The space complexity of GSSOD is O(kn) . The {k} NN search can be accelerated by a {k}\text{d} -tree, and the time complexity of a {k}\text{d} -tree is O(kn\log (n)) . The time complexity of one iteration of GSSOD is O(kn) , and the number of required iterations is generally small (< 500), particularly for a small \alpha .

B. Active Outlier Detection Method

Active and semi-supervised learning can both learn with partially labeled data. However, labeled data are passively provided in semi-supervised learning, but actively selected in active learning. Hence, the data selection strategy is crucial in active learning [9]. One commonly used data selection strategy is uncertain sampling, which selects samples that the current model is least certain about with regard to what the correct output should be [36]. These samples are typically near the classification hyperplane.

For outlier detection, however, we generally cannot get a reliable classification hyperplane due to the insufficient sampling of the outlier class. Hence, uncertain sampling strategy is also unreliable in active outlier detection. Nevertheless, an effective query strategy is available for active outlier detection, namely, top selection, which always selects the top-ranked data to query [5]. Since for outlier detection, the potential classification hyperplane should be near abnormal data, top selection is an approximation of uncertain sampling [10]. Moreover, this strategy is also suitable for the application scenario of outlier detection in which experts generally analyze the top-ranked data in accordance with the outlier score.

By incorporating the query strategy of top selection, our semi-supervised outlier detection method can be easily extended to an active outlier detection method. The flowchart of our proposed graph-based active outlier detection (GAOD) method is shown in Figure 1. The details of GAOD are shown in Algorithm 3. Lines 1-3 are the same as those in GSSOD 2. We first compute the outlier score of iForest, then transform it to soft pseudo-labels by subtracting 0.5. The targets of labeled data are set to the maximum and minimum value of soft pseudo-labels. Lines 4-5 are the same as those in label spreading. We describe the query iteration in lines 9-14. Compared with GSSOD, the major modification is that we use the convergent \boldsymbol {f} as the initial value in the next query iteration. Because only b elements of \boldsymbol {f}^{(0)} and \boldsymbol {\alpha } are modified in every query iteration, the difference of \boldsymbol {f} between two successive query iterations should be small. Hence, this modification can considerably reduce the number of iteration required to convergence.

FIGURE 1. - Flowchart of the proposed graph-based active outlier detection method.
FIGURE 1.

Flowchart of the proposed graph-based active outlier detection method.

Algorithm 3 Graph-Based Active Outlier Detection (GAOD)

Input:

\mathbf {X}, \mathbf {W}, \alpha \in (0, 1) , budget B , batch b

Output:

\boldsymbol {f}

1:

Compute the outlier score of iForest \boldsymbol {f}_{s} using Equation 3

2:

\boldsymbol {f}^{(0)} \gets \boldsymbol {f}_{s} - 0.5

3:

t_{o} \gets \max (\boldsymbol {f}^{(0)}) ; t_{n} \gets \min (\boldsymbol {f}^{(0)})

4:

Compute the diagonal degree matrix \mathbf {D} by d_{ii} \gets \sum _{j} w_{ij}

5:

Compute the spreading matrix \mathbf {S} = \mathbf {D}^{-1/2} \mathbf {W} \mathbf {D}^{-1/2}

6:

\boldsymbol {\alpha } \gets [\alpha, \ldots, \alpha]^{T}

7:

\boldsymbol {f} \gets \boldsymbol {f}^{(0)}

8:

while B > 0 do

9:

Select top b unlabeled data points according to \boldsymbol {f} , \{ \mathbf {x}_{i_{1}}, \ldots, \mathbf {x}_{i_{b}} \}

10:

Obtain labels of X_{b} from expert, \mathbf {y}_{b}

11:

\boldsymbol {f}^{(0)} [\mathbf {y}_{b} = 1] \gets t_{o} ; \boldsymbol {f}^{(0)} [\mathbf {y}_{b} = 0] \gets t_{n}

12:

\boldsymbol {\alpha }[i_{1}, \ldots, i_{b}] = 1-\alpha

13:

Iterate \boldsymbol {f} \gets \boldsymbol {\alpha } \odot (\mathbf {S} \boldsymbol {f}) + (1- \boldsymbol {\alpha }) \odot \boldsymbol {f}^{(0)} until convergence

14:

B \gets B - b

15:

end while

SECTION V.

Experiments and Results

In this section, we will briefly introduce the data sets and the compared methods which are used in the experiments. Afterwards, experimental results are evaluated and analyzed.

A. Experimental Settings

Datasets. We use 12 real-world data sets in our experiments. The summary of these datasets is shown in Table 1. Abalone, Human Activity Recognition (HAR), Satellite, and Seismic are from UCI Machine Learning Repository [37]. MNIST [38] is the famous handwritten digit dataset. Cardio, Covertype, Mammography, Optdigits, Pendigits, and Shuttle are taken from the Outlier Detection Data Sets (ODDS) [39]. These datasets are originally used for classification. Following the paradigm in most outlier detection studies [1], the following transformation is applied to generate data for outlier detection. For datasets with unbalanced classes, the majority classes are used as normal data, while the minority classes are used as outliers. For dataset with balanced classes, several classes are uniformly downsampled to create minority classes, the sampled data are used as outliers. The other classes that were not downsampled are used as normal data.

TABLE 1 Summary of Datasets
Table 1- 
Summary of Datasets

The datasets with an asterisk in Table 1 are the original dataset. We uniformly downsample them because of the high time and space complexity of the baseline methods. The time and space complexity of SSAD [2] are both O(n^{2}) . The time complexity of ensemble-based methods [5], [6] is O(n_{l}^{3}) , where n_{l} is the number of labeled data. Thus, these two methods are unsuitable for experimental settings in which the number of labeled data is larger than 1000.

Baselines. We compare our method with the following baselines.

  • SSC is a semi-supervised classification method based on the original label spreading algorithm [8]. For outlier detection, SSC is configured as a two-class classification. Given that the value of F indicates affinity to a specific class and hard labels are derived using the \arg \max function, we set the outlier score of SSC as F[:, 1] - F[:, 0] . Parameter \alpha is set as 0.99, which was also used in [8].

  • SSAD [2] is a state-of-the-art semi-supervised outlier detection method based on OC-SVM [3]. SSAD uses the radial basis function (RBF) kernel with parameter 2~m \sigma _{X}^{2} , which is in accord with multidimensional Gaussian distribution. The other parameters use their default settings: C_{p}=1.0 , C_{n}=1.0 , C_{u}=1.0 , and \kappa = 1.0 . For active outlier detection, we only use the top selection strategy, because this strategy has been demonstrated to be more effective than the combination strategy called margin and cluster [5]. We use the procedure provided by the authors.1

  • EAAD-L [5] is an ensemble-based active outlier detection approach that uses LODA [21] to create the base detectors. We adopt the public implementation provided by the authors2. The parameters are set as the recommended values in [5], i.e., \tau =0.03 , C_{A} = 100 , and C_{\xi } = 1000 . The original EAAD-L is for active outlier detection, and uses an alternate optimization method to compute ensemble weights. The alternation is performed only once in an active query iteration. For the semi-supervised experiments, we modify the number of alternations to 100, with an early stopping when the ensemble weights are convergent. We find that the alternative optimization method used in EAAD-L does not converge in several datasets.

  • EAAD-T [6] is an ensemble-based active outlier detection approach that uses iForest tree nodes as base detectors. EAAD-T is the subsequent work of the same group of researchers. Compared with EAAD-L, EAAD-T is more fine-grained. In this regard, EAAD-T is the most similar to our method. The implementation procedure and parameter settings are the same as those in EAAD-L.

  • FBiForest [22] also uses the nodes of iForest as base detectors, which is the same as EAAD-T. However, this method adopted an online convex optimization method to optimize the ensemble weights. Hence, we only compare with FBiForest in active outlier detection. We adopt the public implementation provided by the authors.3 The parameters are set as the recommended values in [22].

Evaluation Measurement. We use the area under the receiver operating characteristic (ROC) curve (AUC) as the metric for evaluating the performance of outlier detection. ROC curves plot the true positive rate against the false positive rate. Intuitively, AUC measures the rank accuracy of placing outliers ahead of normal data; this practice is extensively adopted in outlier detection research [1]. For active outlier detection, we also compare the number of true outliers queried by each method.

Parameter Settings. Our method has three parameters: number of nearest neighbors k , standard deviation of Gaussian function \sigma , and the parameter of label spreading \alpha . In general, a small value of k (>10) is sufficient to capture the local manifold, and we set k = 15 . We perform parameter tuning for \sigma on a small synthetic dataset. We set \sigma to half of the 95-percentile of k-th nearest neighbor distances. The purpose of this setting is to make the distribution of values in \mathbf {W} balanced. An excessively large \sigma will make the weights bias to 0, while an excessively small value will bias to 1. We set \alpha = 0.95 . We also perform experiments to test the influence of \alpha which demonstrate that 0.95 is an appropriate value. The convergence condition is 1-norm of f^{(i+1)} - f^{(i)} less than 1e-3 or the number of iterations is up to 1000.

B. Semi-Supervised Outlier Detection

For the labeled data of semi-supervised outlier detection, we assume that they are from an expert’s feedback of the top-ranked points detected by an unsupervised outlier detection method. We choose iForest for this experiment. To ensure the comparability, every semi-supervised method is provided the same labeled data. The number of labeled data is selected in accordance with the number of true outliers in each dataset. Considering the space limit, We only report the results of given 0.5 \times \#outliers and 1.0 \times \#outliers labeled data points in Table 2 and 3.

TABLE 2 AUC Values of Semi-Supervised Experiments With 0.5 \times\#outliers Labeled Data Points. We Run the Experiments 10 Times and Report the Mean and Standard Deviation of the AUC Values. The Top 2 AUCs are Written in bold
Table 2- 
AUC Values of Semi-Supervised Experiments With 
$0.5 \times\#outliers$
 Labeled Data Points. We Run the Experiments 10 Times and Report the Mean and Standard Deviation of the AUC Values. The Top 2 AUCs are Written in bold
TABLE 3 AUC Values of Semi-Supervised Experiments With 1.0 \times\#outliers Labeled Data Points. We Run the Experiments 10 Times and Report the Mean and Standard Deviation of the AUC Values. The Top 2 AUCs are Written in bold
Table 3- 
AUC Values of Semi-Supervised Experiments With 
$1.0 \times\#outliers$
 Labeled Data Points. We Run the Experiments 10 Times and Report the Mean and Standard Deviation of the AUC Values. The Top 2 AUCs are Written in bold

As shown in Table 2 and 3, the performance of GSSOD is better than or comparable with the best of the baselines. The supervised method SSC is unreliable. Although it obtains the highest AUCs for several datasets, other datasets are even worse than a random guess. Among all the semi-supervised outlier detection methods, SSAD is the worst. EAAD-T is better than EAAD-L, because EAAD-T is more fine-grained. Our method is also fine-grained. Although EAAD-T achieves better performance than GSSOD in 4 out of 12 datasets, rejecting the null hypothesis is still significant when applying the one-sided Wilcoxon signed rank test between EAAD-T and GSSOD at a significance level of 0.05. The results of the statistical tests are presented in Table 4. The results also show that our method is better than EAAD-T with more confidence when the number of labeled data is less.

TABLE 4 Results of the One-Sided Wilcoxon Signed Rank Test Between EAAD-T and GSSOD
Table 4- 
Results of the One-Sided Wilcoxon Signed Rank Test Between EAAD-T and GSSOD

Given that \alpha is an important parameter that controls the balance between label smoothness and deviation from initial values \boldsymbol {f}^{(0)} , we conduct experiments to test the sensibility of \alpha . As shown in Figure 2, \alpha can influence the final result to a certain extent. However, the tendency is inconsistent with all the datasets as \alpha increases. Overall, 0.95 is an appropriate value for \alpha . The primary reason why we recommend using 0.95 instead of 0.99 is that the initial value we used is meaningful outlier score instead of zeros in label spreading. A small \alpha will balance more with the initial value. Another benefit of setting \alpha = 0.95 is that the number of iterations required for convergence is only approximately half of that when \alpha = 0.99 .

FIGURE 2. - Sensibility of 
$\alpha $
. The experiments are run 10 times. The number of labeled points is 
$0.5 \times \#outliers$
. The datasets are labeled by numbers with the same order as that in Table 1 and 2.
FIGURE 2.

Sensibility of \alpha . The experiments are run 10 times. The number of labeled points is 0.5 \times \#outliers . The datasets are labeled by numbers with the same order as that in Table 1 and 2.

C. Active Outlier Detection

In the experiments for active outlier detection, we set the budget B equal to 2 \times \#outliers and the query batch b=4 at each iteration. The learning curves are shown in Figure 3

FIGURE 3. - Learning curves of the first 6 datasets: AUC vs. query iterations and number of true outliers vs. query iterations. For stochastic methods, we run experiments 10 times and only report the mean for clarity.
FIGURE 3.

Learning curves of the first 6 datasets: AUC vs. query iterations and number of true outliers vs. query iterations. For stochastic methods, we run experiments 10 times and only report the mean for clarity.

FIGURE 4. - Learning curves of the last 6 datasets: AUC vs. query iterations and number of true outliers vs. query iterations. For stochastic methods, we run experiments 10 times and only report the mean for clarity.
FIGURE 4.

Learning curves of the last 6 datasets: AUC vs. query iterations and number of true outliers vs. query iterations. For stochastic methods, we run experiments 10 times and only report the mean for clarity.

As indicated in the figure, our method achieves the best performance in both measurements compared with the baselines for most datasets. When measured by the number of queried true outliers, FBiForest is the closest to our method. However, FBiForest suffers a large drop of AUC in many datasets as obtained more labeled data, and other baselines have the same problem. This result shows that our method can better balance supervised information and unsupervised information. Hence, our method can induce better detection model when given the same query budget. For datasets Mammography and Satellite, the baselines are better than our method when measured by AUC. We believe that this result can be attributed to these methods being discriminative and our method being data-based; hence, more fine-grained. For a discriminative model, a slight change will considerably affect the final result. In some cases, the change may achieve considerable performance improvement; however, the impact may also be negative. The same reason explains why our method is more stable. Compared with the occasional performance improvement, we believe that avoiding negative cases is more important for outlier detection.

SECTION VI.

Conclusion

In this paper, we first propose a graph-based semi-supervised method (GSSOD), then we propose an active outlier detection method based GSSOD. GSSOD adds a mechanism to incorporate labeled data into an unsupervised outlier detection method. Although the unsupervised outlier detection method we used is iForest, other methods can also be adopted. Extensive experiments show that our semi-supervised outlier detection method is comparable with the best of state-of-the-art approaches, and our active outlier detection method outperforms state-of-the-art methods in terms of AUC and the number of true outliers queried. For our future work, we will consider using recently proposed graph convolutional networks [40]–​[42] in semi-supervised outlier detection. We will also develop an advanced data query strategy for active outlier detection. Given that the active outlier detection used in this study is pool-based, we will also consider extending our method to the stream setting.

Usage
Select a Year
2025

View as

Total usage sinceOct 2019:833
05101520JanFebMarAprMayJunJulAugSepOctNovDec3581980000000
Year Total:43
Data is updated monthly. Usage includes PDF downloads and HTML views.

References

References is not available for this document.