Introduction
As an important task in data mining, the purpose of outlier detection is to reveal data patterns different from existing data [1]. An outlier can be defined as “an observation which deviates so much from other observations as to arouse suspicions that is generated by a different mechanism [2]”. Another definition is given by Barnett and Lewis, “an outlier is an observation (or a set of observations) which appears to be inconsistence with the remainder of the given dataset [3]”. Outliers can be anomalies, novelties, noise, deviations, and exceptions [4]. And an outlier usually represents a new perspective or a specific mechanism which attracts higher interest than the normal instances. Therefore, outlier detection has been widely used in different domains, e.g., human activity recognition [5]; credit card fraud detection [6]; medical diagnosis [7]; video detection [8] and fault diagnosis [9].
Generally, according to whether the dataset is labelled or not, outlier detection methods can be roughly divided into three categories, namely supervised methods [10], semi-supervised methods [11] and unsupervised methods [12]. Most of the datasets collected in real engineering contexts are unlabeled, and the labeling are problematic or cost unacceptable. Therefore, unsupervised outlier detection method is very popular because it does not require a labelled training dataset. In recent decades, various unsupervised outlier detection technologies have been proposed, most of which are based on the nearest neighbor of data objects [13]. There exist two kinds of the nearest neighbor concepts, i.e.,
Besides, the datasets often include both numerical and categorical attributes simultaneously, that are, datasets with mixed-valued attributes (DMAs) for many real-world problems. For example, some unpredictable outliers often appear in the actual operation of the warehousing system, especially in the emerging warehousing system, which increase the liabilities of the warehousing industry [15]. The abnormal operation status in a warehousing system is described as an object with mixed-valued attributes so as to realize the health diagnosis of the warehouse system [16]. More and more scholars focus on the outlier detection methods for the DMA to solve practical problems. However, most outlier detection methods based on the k-NN are developed to handle the datasets with only numerical attributes [17]–[19]. Therefore, the analysis and research of outlier detection methods for the DMA based on the k-NN is theoretically and practically significant.
Furthermore, the random walk process has been widely used for a variety of information retrieval tasks, including web search [20], keyword extraction [21], and text summarization [22]. These methods usually build a network model for data objects and perform a random walk process on it to evaluate the centrality or importance of each data object. Moonesinghe and Tan [23] applied the random walk process to outlier detection, and verified the accuracy of the detection results through real datasets and synthetic datasets. The k-NN focuses on the local information of each data object and ignores the possible internal connections in the whole dataset. The random walk process on the network model just makes up for this drawback and emphasizes the overall or partial structure of the dataset. The combination of the
The main contributions of this paper are threefold.
We propose an adaptive search algorithm for the value of
based on the k-NN. It can automatically search the appropriate value ofk according to the distribution characteristics of data objects in different datasets. This algorithm enables the unsupervised mechanism in the proposed outlier detection method for the DMA.k We combine the k-NN with a random walk process to construct the outlier score for mixed-valued data objects. The k-NN obtains the information around each data object, and the random walk process on the network model explores the relationships between data objects from the perspective of the whole dataset. In this way, we can not only make full use of the local information of each mixed-valued data object, but also consider the outlier degree of each data object from the global perspective.
We validate the effectiveness, accuracy, and applicability of the proposed methodology with three other related outlier detection methods by using three UCI datasets with different data types, which contain a dataset with numerical attributes, a dataset with categorical attributes, and a DMA. Several evaluation metrics that include precision, recall, rank power, and time consumption are employed to evaluate the performance of the methods in the experiments.
The remainder of this paper is organized as follows. The related works are presented in Section 2. The proposed methodology is detailed in Section 3. The detailed experiment on three types of UCI datasets is implemented in Section 4. The conclusions and future work are provided in Section 5.
Related Works
In this section, two kinds of outlier detection methods related to this study are investigated: (1) outlier detection methods based on the k-NN and (2) outlier detection methods based on random walk process, which are listed in Table 1.
A. Outlier Detection Based on the k-NN
For decades, scholars have contributed a lot on outlier detection based on the k-NN. Dong and Yan [25] propose a multivariate outlier detection method based on the k-NN. Wang et al.. [19] give each data object a local outlier score and a global outlier score based on the k-NN medoid to measure whether a data object is an outlier. Muthukrishnan [26] introduces the Reverse Nearest Neighbor (RNN) which lays a foundation for the RNN-based outlier detection method. Uttarkabat et al. [13] use the statistical information of the RNN and the k-NN, define the distance factor to measure the outlier degree of data objects. The generalized form of the RNN is reverse k-NN (R k-NN). Cao et al.. [27] propose a novel stream outlier detection method based on the R k-NN to avoid multi-scan of the dataset and to capture concept drift. Batchanaboyina and Devarakonda [28] propose an efficient outlier detection approach using improved monarch butterfly optimization and mutual nearest neighbors.
However, the value of
In conclusion, the k-NN which can effectively express the local information around the data objects is widely used in outlier detection methods and its effectiveness is proved by a large number of studies. However, the selection of the value of parameter
B. Outlier Detection Based on Random Walk Process
In order to broaden the application fields of the random walk process, Moonesinghe and Tan [23] apply it to outlier detection and propose two strategies for constructing networks, using appropriate similarity measures and the number of shared neighbors, respectively. The accuracy of the detection results is verified by real datasets and synthetic datasets. Afterwards, people have further explored the outlier detection methods based on the random walk process, e.g., [31], [32], [24], [33], and [34]. In these methods, each data object is modelled as a node in a network, and the relationship between objects is defined as an edge that connects the nodes. The nodes and edges in a network are analyzed by deeply mining the characteristics of topological structure to define the outlier score of each data object [35].
From the above, the keys of this kind of methods are how to build the network model and how to determine the index to measure the outlier degree of data objects. Scholars mostly build the network model based on the neighbourhood system of each data object, and the combination of the k-NN and the random walk process is only used to deal with the dataset with numerical attributes, and the outlier detection results are affected by the value of
Methodology
The proposed methodology will be discussed in detail in this section, which includes three parts: a) data preprocessing, b) constructing an adaptive k-NN global weighted and directed network model, and c) performing a random walk process to identify outliers.
A. Data Preprocessing
Let
The distribution of the original data objects on each attribute is different. The attribute has great impact on the overall deviation degree of data objects when the discrete degree of the distribution of data objects on it is large. Therefore, the entropy weight method [36] is applied to objectively weight \begin{align*} w^{j^{N}}=&\frac {1-E^{j^{N}}}{d-\sum \nolimits _{j^{N}=1}^{d_{1}} E^{j^{N}} -\sum \nolimits _{j^{C}=1}^{d_{2}} E^{j^{C}}},\tag{1}\\ w^{j^{C}}=&\frac {1-E^{j^{C}}}{d-\sum \nolimits _{j^{N}=1}^{d_{1}} E^{j^{N}} -\sum \nolimits _{j^{C}=1}^{d_{2}} E^{j^{C}}},\tag{2}\end{align*}
\begin{equation*} E^{j^{N}}=-\frac {1}{\mathrm {ln}n}\sum \nolimits _{i=1}^{n} {\frac {x_{i}^{{j^{N}}^{\prime }}}{\sum \nolimits _{i=1}^{n} x_{i}^{{j^{N}}^{\prime }} }\mathrm {ln}\frac {x_{i}^{{j^{N}}^{\prime }}}{\sum \nolimits _{i=1}^{n} x_{i}^{{j^{N}}^{\prime }}}},\tag{3}\end{equation*}
The values of data objects on the categorical attributes are expressed by \begin{equation*} X^{j^{C}}=\{x_{1}^{j^{C}},\quad x_{2}^{j^{C}},\ldots,x_{c}^{j^{C}},\tag{4}\end{equation*}
\begin{equation*} E^{j^{C}}=-\frac {1}{\mathrm {ln}n}\sum \nolimits _{l=1}^{c} {\frac {F_{l}^{j^{C}}}{n}\mathrm {ln}\frac {F_{l}^{j^{C}}}{n}}.\tag{5}\end{equation*}
In order to eliminate the differences in dimensions and orders of magnitude of the original data objects on numerical attributes, the min-max normalization method [37] which linearly transforms the variables is used to process the numerical data objects, and the standardized ones are recorded as, \begin{equation*} x_{i}^{j^{N}}=\frac {x_{i}^{{j^{N}}^{\prime }}-\left \{{ x_{i}^{{j^{N}}^{\prime }} }\right \}}{\left \{{x_{i}^{{j^{N}}^{\prime }} }\right \}-\left \{{x_{i}^{{j^{N}}^{\prime }} }\right \}},\tag{6}\end{equation*}
B. Constructing an Adaptive k-NN Global Weighted and Directed Network Model
The k-NN of the data object \begin{align*}&\hspace {-.5pc}k-NN(x_{i})=\{x_{m}{\it \vert } (x_{m}\in X)\cap D_{im} \le D_{i}^{\ast }\} \\&(i, m=1, 2, \ldots, n, m \ne i),\tag{7}\end{align*}
\begin{equation*} D_{im}=\sqrt {\sum \nolimits _{j^{N}=1}^{d_{1}} {w^{j^{N}}\left ({d_{im}^{j^{N}} }\right)^{2}} +\sum \nolimits _{j^{C}=1}^{d_{2}} {w^{j^{C}}\left ({d_{im}^{j^{C}} }\right)^{2}}},\tag{8}\end{equation*}
\begin{align*} d_{im}^{j^{N}}=&\left |{ x_{i}^{j^{N}}{-x}_{m}^{j^{N}} }\right |,\tag{9}\\ d_{im}^{j^{C}}=&\begin{cases} \displaystyle 0, & {x}_{i}^{j^{C}}= x_{m}^{j^{C}}; \\ \displaystyle 1, & {x}_{i}^{j^{C}}\ne x_{m}^{j^{C}}. \end{cases}\tag{10}\end{align*}
The value of
Algorithm 1 Automatic Search Algorithm for the Value of k
Dataset
the adaptive
for
for
sort
end for
end for
while
for
end for
foreach
if exists
else
end if
end
end while
return
Then, an adaptive k-NN weighted and directed network model (shown as Fig. 1 (\begin{align*} E_{im}=\begin{cases} \displaystyle 1, &\mathrm {if}~x_{m}\in k-NN(x_{i}\mathrm {);} \\ \displaystyle 0, &\mathrm {otherwise.} \end{cases}\tag{11}\end{align*}
The adaptive k-NN weighted and directed network of the dataset can be represented by an adjacent matrix A under the rule that: if \begin{align*} A_{im}=\begin{cases} \displaystyle 1-D_{im},&if~E_{im}\mathrm {=1;} \\ \displaystyle 0,&if~E_{im}\mathrm {=0.} \\ \displaystyle \end{cases}\tag{12}\end{align*}
In this network, it is assumed that there is an edge between one node and its k-NN, which is from the node to its k-NN, and the weight of the connecting edge is defined by Eq. (12). As shown in Fig. 1 (
The edge weights of the bidirectional edges between \begin{align*} \omega _{iG}=&\left \{{A_{im},\left ({A_{im}\mathrm {\ne 0} }\right) }\right \},\tag{13}\\ \omega _{Gi}=&\sum \nolimits _{m=1}^{n} A_{im}.\tag{14}\end{align*}
The adjacency matrix A\begin{align*} \boldsymbol {A}_{G}=\left [{ {\begin{array}{cccccccccccccccccccc} 0 &\quad A_{12} &\quad \cdots &\quad A_{1n} &\quad \omega _{1G}\\ A_{21} &\quad 0 &\quad \cdots &\quad A_{2n} &\quad \omega _{2G}\\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots &\quad \vdots \\ A_{n1} &\quad A_{n2} &\quad \cdots &\quad 0 &\quad \omega _{nG}\\ \omega _{G1} &\quad \omega _{G2} &\quad \cdots &\quad \omega _{Gn} &\quad 0\\ \end{array}} }\right].\tag{15}\end{align*}
C. Performing a Random Walk Process to Identify Outliers
In a specific network, a random walk process is defined as a stochastic process that a random walker moves from \begin{align*} p_{im}^{t}={p}_{im}^{t-1}= p\left ({x^{t}=x_{m}\mathrm {\vert }x^{t-1}=x_{i} }\right)\left ({\forall i:\sum \nolimits _{m} p_{im} =1 }\right). \\{}\tag{16}\end{align*}
The adaptive k-NN global weighted and directed network represented by an adjacent matrix A\begin{equation*} \mathbf {P}_{G} = \mathbf {A}_{G} \times \mathbf {D}^{-1},\tag{17}\end{equation*}
The random walker is expected to jump to a neighbor with a greater similarity to the current node at the next time step. Using the edge weights between nodes to customize the transition probability of the random walk process can achieve the above effect and effectively complete the transition process between nodes. Starting from any state at any time, the random walk process will converge to an equilibrium state after a certain number of iterations, and the probability of the random walker at each node will not change. Using an iterative method, the stationary distribution vector of the random walk process on the adaptive k-NN global weighted and directed network can be estimated. The iterative process is formalized as follows:\begin{equation*} \boldsymbol {\pi }^{t} = \boldsymbol {\pi }^{t-1} \times \mathbf {P}_{G},\tag{18}\end{equation*}
After the random walk process reaches equilibrium, each element in the stationary distribution vector \begin{equation*} \Psi _{i}=\frac {1}{\pi _{i}^{t}+\frac {\omega _{Gi}}{\sum \nolimits _{i=1}^{n} \omega _{Gi}}\pi _{G}^{t}}.\tag{19}\end{equation*}
In order to ensure the convergence of the random walk process,
The proposed outlier detection methodology is presented in Algorithm 2.
Algorithm 2 Calculate the Outlier Score for Each Data Object
Dataset
outlier score of each data object
for
k-NN(
end for
for
for
end for
end for
for
end for
for
end for
return
The proposed method first searches the value of
Experiments
The proposed method is experimentally verified on three UCI datasets with three related outlier detection methods: the neighborhood information entropy-based outlier detection method (NIEOD) [39], the outlier detection using random walks (OutRank) [23], and the virtual outlier score model (VOS) [33], and the experimental environment is shown in Table 2.
The NIEOD is proposed to detect outliers in the dataset with numerical, categorical and mixed-valued data by using the neighborhood information system and information entropy. The heterogeneous distance and self-adapting radius are applied to determine the neighborhood information system of the dataset, the neighborhood information entropy, relative neighborhood entropy, deviation degree, and outlier factor are further constructed based on the neighborhood information around each data object to measure the outlier degree of each data object. It has extended the outlier detection methods which are based on the traditional distance and rough set, and more applies to the datasets with some uncertainty mechanisms. There are two parameters in this method, namely, the neighborhood radius adjustment parameter and the judgement threshold.
The OutRank explores the application of the random walk process in outlier detection for the dataset with numerical data, the random walk process can effectively capture not only the uniformly dispersed outliers but also small clusters of outliers. It builds a weighted and undirected neighborhood network model and uses the cosine similarity between objects and the shared-nearest neighbor density to define similarity metric, respectively. It has yield higher detection rates with lower false alarm rates than the outlier detection methods based on distance and density on both real and synthetic datasets.
The VOS improves the outlier detection methods based on the random walk process, combines the k-NN with the network model to identify outliers in the dataset with numerical attributes. It uses the top-
The proposed method in this paper can identify the outliers in a dataset with numerical, categorical or mixed-valued data integrating the advantages of the above methods, which overcomes the defects of these methods to a certain extent. And, the predetermined parameter such as
A. UCI Datasets
Glass Identification dataset, Hayes-Roth dataset, and Lymphography dataset in the UCI machine learning library [40] are applied to evaluate the performance of the proposed method. These datasets are marked for classification and the rare classes are known, the data objects in the rare classes are considered as outliers.
Glass Identification dataset contains 214 data objects, which described by 1 name attribute, 9 numerical attributes, and 1 class attribute. The 214 data objects are divided into 6 categories with 70, 76, 17, 13, 9, and 29 data objects in each category, respectively. The class 5 has the fewest objects and can be treated as the rare class, in which the data objects are outliers.
Hayes-Roth dataset contains 132 data objects with 1 name attribute, 4 categorical attributes, and 1 class attribute. The 132 data objects are divided into 3 categories and each category has 51, 51, and 30 data objects, respectively. The class 3 has the fewest objects and can be treated as the rare class, in which the data objects are outliers.
Lymphography dataset contains 148 data objects, which described by 3 numerical attributes, 15 categorical attributes, and 1 class attribute. The 148 data objects are divided into 4 categories with 2, 81, 61, and 4 data objects in each category, respectively. The class 1 and class 4 have the fewest objects and can be treated as the rare classes, in which the data objects are outliers.
The data distribution of the above three datasets is shown in Fig. 2 (
B. Evaluation Metrics for the Performance of Methods
In order to quantitatively analyze the experimental results of different outlier detection methods, three traditional information system quality metrics “precision”(Pre), “recall”(Rec) and “rank power”(RP) are used in this paper [41].
Pre calculates the proportion of the real outliers in the dataset identified as outliers in the first \begin{equation*} {\textit{Pre}} = \frac {N_{identified}}{z},\tag{20}\end{equation*}
Rec measures the percentage of the real outliers identified in the first \begin{equation*} {\textit{Rec}} = \frac {N_{identified}}{N_{real}},\tag{21}\end{equation*}
Pre and Rec estimate the accuracy of detection results of an outlier detection method, but neither of them can accurately compare the quality of detection results of different methods. For example, Pre and Rec of two real outliers identified by the method at the first two positions are the same as those at any two positions in the first \begin{equation*} {\textit{RP}} = N_{identified}\frac {N_{identified}+1}{2\sum \nolimits _{L=1}^{N_{identified}} O_{L}},\tag{22}\end{equation*}
Pre and Rec are positively correlated with the effectiveness of outlier detection methods. When Pre and Rec are the same, the larger the RP is, the more effective the outlier detection method is.
C. Experiment Results
Here, the detection results of the proposed method with those of other three methods on three different types of UCI datasets are compared by the above indexes. The above four methods no longer classify the data objects into normal objects and outliers. Instead, each data object is assigned a score to measure the outlier degree, and then the data objects are sorted (in the ascending or descending order) based on the judgment mechanism of the method. The first
The scores of data objects in the dataset based on the proposed method, the NIEOD, and the VOS are sorted in a descending order, while based on the OutRank is sorted in an ascending order considering the different mechanism of each method to construct the index to measure the outlier degree of data objects. With the increase of the first
There are some supplementary explanations of detection results in Table 3, Table 4 and Table 5. “Top ratio” indicates the proportion of the first
For better visualization, the detection results on different datasets in Table 3, Table 4 and Table 5 are illustrated in Fig. 3 (
The analysis of detection results based on evaluation indexes for each method on the three datasets are shown in Table 6, Table 7 and Table 8, respectively, and the corresponding results are shown in Fig. 5, Fig. 6, and Fig. 7, respectively.
Analysis results of the four methods on the Glass Identification dataset.
The running time of the four outlier detection methods on the three datasets are exhibited in Table 9.
D. Discussion and Analysis
The acquisition details of the experiment results in Section 4.3 are as follows:
1) Glass Identification Dataset
As shown in Table 3, Fig. 3 (
2) Hayes-Roth Dataset
The proposed method detects all real outliers in the dataset when the first 49 (37%) data objects are selected, while the other three methods detect 14, 10, and 14 data objects as shown in Table 4 and Fig. 3 (
3) Lymphgraphy Dataset
From Table 5, Fig. 3 (
Based on the above observations, the following conclusions can be drawn:
The proposed method can be used to detect outliers in different types of datasets, including the dataset with numerical attributes, categorical attributes, and mixed-valued attributes.
The adaptive
is obtained automatically according to the distribution characteristics of the dataset, which ensures higher quality of outlier mining and reduces the cost in the process of parameter adjustment.k In the proposed method, the k-NN is used to mine the local information of data objects, and a customized random walk process is used to explore the long-term correlation between related data objects from a global perspective. The combination of the k-NN and the random walk process improves the accuracy of detection results.
Conclusion
In this paper, an unsupervised outlier detection method based on the adaptive k-NN global network is proposed to identify the outliers in a DMA. First, the process of determining the weight of numerical and categorical attributes is given respectively to reduce the influence of different attributes on data objects, and the spatial distance between mixed-valued data objects is measured based on the Heterogeneous Euclidean-Overlap Metric. Second, an adaptive search algorithm for the appropriate value of
However, calculating the stationary distribution vector is a time-consuming process, which limits the application of the outlier detection methods based on the random walk process to the large-scale and stream dataset. The main work in the next stage is how to apply this method to practical application scenarios with large datasets.