Loading [MathJax]/extensions/TeX/boldsymbol.js
Unsupervised Outlier Detection for Mixed-Valued Dataset Based on the Adaptive k-Nearest Neighbor Global Network | IEEE Journals & Magazine | IEEE Xplore

Unsupervised Outlier Detection for Mixed-Valued Dataset Based on the Adaptive k-Nearest Neighbor Global Network


An outlier detection method based on the adaptive k-nearest neighbor global network for mixed-valued datasets is proposed, and the performance of the proposed method is v...

Abstract:

Outlier detection aims to reveal data patterns different from existing data. Benefit from its good robustness and interpretability, the outlier detection method for numer...Show More

Abstract:

Outlier detection aims to reveal data patterns different from existing data. Benefit from its good robustness and interpretability, the outlier detection method for numerical dataset based on k -Nearest Neighbor ( k -NN) network has attracted much attention in recent years. However, the datasets produced in many practical contexts tend to contain both numerical and categorical attributes, that are, the datasets with mixed-valued attributes (DMAs). And, the selection of k is also an issue that is worthy of attention for unlabeled datasets. Therefore, an unsupervised outlier detection method for DMA based on an adaptive k -NN global network is proposed. First, an adaptive search algorithm for the appropriate value of k considering the distribution characteristics of datasets is introduced. Next, the distance between mixed-valued data objects is measured based on the Heterogeneous Euclidean-Overlap Metric, and the k -NN of a data object is obtained. Then, an adaptive k -NN global network is constructed based on the neighborhood relationships between data objects, and a customized random walk process is executed on it to detect outliers by using the transition probability to limit behaviors of the random walker. Finally, the effectiveness, accuracy, and applicability of the proposed method are demonstrated by a detailed experiment.
An outlier detection method based on the adaptive k-nearest neighbor global network for mixed-valued datasets is proposed, and the performance of the proposed method is v...
Published in: IEEE Access ( Volume: 10)
Page(s): 32093 - 32103
Date of Publication: 22 March 2022
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

As an important task in data mining, the purpose of outlier detection is to reveal data patterns different from existing data [1]. An outlier can be defined as “an observation which deviates so much from other observations as to arouse suspicions that is generated by a different mechanism [2]”. Another definition is given by Barnett and Lewis, “an outlier is an observation (or a set of observations) which appears to be inconsistence with the remainder of the given dataset [3]”. Outliers can be anomalies, novelties, noise, deviations, and exceptions [4]. And an outlier usually represents a new perspective or a specific mechanism which attracts higher interest than the normal instances. Therefore, outlier detection has been widely used in different domains, e.g., human activity recognition [5]; credit card fraud detection [6]; medical diagnosis [7]; video detection [8] and fault diagnosis [9].

Generally, according to whether the dataset is labelled or not, outlier detection methods can be roughly divided into three categories, namely supervised methods [10], semi-supervised methods [11] and unsupervised methods [12]. Most of the datasets collected in real engineering contexts are unlabeled, and the labeling are problematic or cost unacceptable. Therefore, unsupervised outlier detection method is very popular because it does not require a labelled training dataset. In recent decades, various unsupervised outlier detection technologies have been proposed, most of which are based on the nearest neighbor of data objects [13]. There exist two kinds of the nearest neighbor concepts, i.e., \varepsilon -Nearest Neighbor and k -Nearest Neighbor (k-NN) [14], among which, the k-NN is more widely adopted. The core idea of the k-NN is to select a specific k for a dataset and find k data objects with the greatest similarity or the shortest distance from each data object in the dataset. When the value of k is too large, the neighbors of each data object will contain useless information or even lead to errors in subsequent algorithms. Conversely, if the value of k is too small, the data object will have fewer neighbors and contain limited useful information, which will reduce the accuracy of the algorithm. For different datasets, the value of k is often different in order to ensure the optimal performance of the outlier detection results. Therefore, a search algorithm for the appropriate value of k is proposed to automatically determine the value of k for different distributed datasets.

Besides, the datasets often include both numerical and categorical attributes simultaneously, that are, datasets with mixed-valued attributes (DMAs) for many real-world problems. For example, some unpredictable outliers often appear in the actual operation of the warehousing system, especially in the emerging warehousing system, which increase the liabilities of the warehousing industry [15]. The abnormal operation status in a warehousing system is described as an object with mixed-valued attributes so as to realize the health diagnosis of the warehouse system [16]. More and more scholars focus on the outlier detection methods for the DMA to solve practical problems. However, most outlier detection methods based on the k-NN are developed to handle the datasets with only numerical attributes [17]–​[19]. Therefore, the analysis and research of outlier detection methods for the DMA based on the k-NN is theoretically and practically significant.

Furthermore, the random walk process has been widely used for a variety of information retrieval tasks, including web search [20], keyword extraction [21], and text summarization [22]. These methods usually build a network model for data objects and perform a random walk process on it to evaluate the centrality or importance of each data object. Moonesinghe and Tan [23] applied the random walk process to outlier detection, and verified the accuracy of the detection results through real datasets and synthetic datasets. The k-NN focuses on the local information of each data object and ignores the possible internal connections in the whole dataset. The random walk process on the network model just makes up for this drawback and emphasizes the overall or partial structure of the dataset. The combination of the k -NN and a random walk process can relevantly characterize the relationships between data objects and the hermit connections in the dataset [24]. Motivated and inspired by the above observations, an unsupervised outlier detection method based on an adaptive k-NN network to handle the DMA is proposed in this study. First, considering the influence of different attributes on data objects, the heterogeneous distance function is combined to measure the spatial distance between mixed-valued data objects in a DMA. Second, an adaptive search algorithm for the appropriate value of k is proposed according to the distribution characteristics of the dataset, then the k-NN for each mixed-valued data object is obtained. Then, an adaptive k-NN global weighted and directed network model is constructed based on the neighborhood relationships between data objects, and a customized random walk process is implemented on it. After the random walk process converges to an equilibrium state, the element in the stationary distribution vector is used to construct the outlier score of each data object. Finally, the proposed method is compared with other existing related outlier detection methods on three different types of datasets from University of California Irvine (UCI) Machine Learning Repository, and the results show that the proposed method is more effective, accurate, and applicable.

The main contributions of this paper are threefold.

  1. We propose an adaptive search algorithm for the value of k based on the k-NN. It can automatically search the appropriate value of k according to the distribution characteristics of data objects in different datasets. This algorithm enables the unsupervised mechanism in the proposed outlier detection method for the DMA.

  2. We combine the k-NN with a random walk process to construct the outlier score for mixed-valued data objects. The k-NN obtains the information around each data object, and the random walk process on the network model explores the relationships between data objects from the perspective of the whole dataset. In this way, we can not only make full use of the local information of each mixed-valued data object, but also consider the outlier degree of each data object from the global perspective.

  3. We validate the effectiveness, accuracy, and applicability of the proposed methodology with three other related outlier detection methods by using three UCI datasets with different data types, which contain a dataset with numerical attributes, a dataset with categorical attributes, and a DMA. Several evaluation metrics that include precision, recall, rank power, and time consumption are employed to evaluate the performance of the methods in the experiments.

The remainder of this paper is organized as follows. The related works are presented in Section 2. The proposed methodology is detailed in Section 3. The detailed experiment on three types of UCI datasets is implemented in Section 4. The conclusions and future work are provided in Section 5.

SECTION II.

Related Works

In this section, two kinds of outlier detection methods related to this study are investigated: (1) outlier detection methods based on the k-NN and (2) outlier detection methods based on random walk process, which are listed in Table 1.

TABLE 1 Descriptions of the Related Works
Table 1- 
Descriptions of the Related Works

A. Outlier Detection Based on the k-NN

For decades, scholars have contributed a lot on outlier detection based on the k-NN. Dong and Yan [25] propose a multivariate outlier detection method based on the k-NN. Wang et al.. [19] give each data object a local outlier score and a global outlier score based on the k-NN medoid to measure whether a data object is an outlier. Muthukrishnan [26] introduces the Reverse Nearest Neighbor (RNN) which lays a foundation for the RNN-based outlier detection method. Uttarkabat et al. [13] use the statistical information of the RNN and the k-NN, define the distance factor to measure the outlier degree of data objects. The generalized form of the RNN is reverse k-NN (R k-NN). Cao et al.. [27] propose a novel stream outlier detection method based on the R k-NN to avoid multi-scan of the dataset and to capture concept drift. Batchanaboyina and Devarakonda [28] propose an efficient outlier detection approach using improved monarch butterfly optimization and mutual nearest neighbors.

However, the value of k is still an important factor affecting the accuracy of outlier detection results in this kind of methods. In order to deal with the problem of parameter sensitivity, some novel strategies have been proposed. Inspired by the concept that outlying objects are less easily selected than inlying objects in blind random sampling, Ha et al. [29] solve the problem of the effect of parameter k . However, the algorithm still has parameters, such as iteration times and sampling size, which does not fundamentally eliminate the dependence on parameters. Ning et al.. [30] propose a parameter selection method based on a mutual neighbor graph, but this is at the expense of the accuracy and efficiency of identification.

In conclusion, the k-NN which can effectively express the local information around the data objects is widely used in outlier detection methods and its effectiveness is proved by a large number of studies. However, the selection of the value of parameter k is still an issue that needs to be tackled.

B. Outlier Detection Based on Random Walk Process

In order to broaden the application fields of the random walk process, Moonesinghe and Tan [23] apply it to outlier detection and propose two strategies for constructing networks, using appropriate similarity measures and the number of shared neighbors, respectively. The accuracy of the detection results is verified by real datasets and synthetic datasets. Afterwards, people have further explored the outlier detection methods based on the random walk process, e.g., [31], [32], [24], [33], and [34]. In these methods, each data object is modelled as a node in a network, and the relationship between objects is defined as an edge that connects the nodes. The nodes and edges in a network are analyzed by deeply mining the characteristics of topological structure to define the outlier score of each data object [35].

From the above, the keys of this kind of methods are how to build the network model and how to determine the index to measure the outlier degree of data objects. Scholars mostly build the network model based on the neighbourhood system of each data object, and the combination of the k-NN and the random walk process is only used to deal with the dataset with numerical attributes, and the outlier detection results are affected by the value of k . Therefore, this paper proposes an unsupervised outlier detection method which can deal with the DMA combining the k-NN and the random walk process.

SECTION III.

Methodology

The proposed methodology will be discussed in detail in this section, which includes three parts: a) data preprocessing, b) constructing an adaptive k-NN global weighted and directed network model, and c) performing a random walk process to identify outliers.

A. Data Preprocessing

Let X = {x_{1} , x_{2}, \ldots, x_{n} } be a set of n mixed-valued data objects, and A = {a^{1} , a^{2}, \ldots, a^{d} } be a set of attributes. Each data object is described by d_{1} numerical attributes and d_{2} categorical attributes, and d_{1}+ d_{2} =d . A data object x_{i} (i=1, 2, \ldots, n ) is represented as x_{i} =[x_{i}^{{j^{N}}^{\prime }} , x_{i}^{j^{C}} ], where x_{i}^{{j^{N}}^{\prime }}(j^{N}=1, 2, \ldots, d_{1}) and x_{i}^{j^{C}}(j^{C}=1, 2, \ldots, d_{2}) represent the value of the i^{\mathrm {th}} data object on the j^{\mathrm {th}} numerical attribute and the j^{\mathrm {th}} categorical attribute, respectively.

The distribution of the original data objects on each attribute is different. The attribute has great impact on the overall deviation degree of data objects when the discrete degree of the distribution of data objects on it is large. Therefore, the entropy weight method [36] is applied to objectively weight d attributes according to the discreteness of the distribution of data objects on the attributes.\begin{align*} w^{j^{N}}=&\frac {1-E^{j^{N}}}{d-\sum \nolimits _{j^{N}=1}^{d_{1}} E^{j^{N}} -\sum \nolimits _{j^{C}=1}^{d_{2}} E^{j^{C}}},\tag{1}\\ w^{j^{C}}=&\frac {1-E^{j^{C}}}{d-\sum \nolimits _{j^{N}=1}^{d_{1}} E^{j^{N}} -\sum \nolimits _{j^{C}=1}^{d_{2}} E^{j^{C}}},\tag{2}\end{align*} View SourceRight-click on figure for MathML and additional features. where, E^{j^{N}} and E^{j^{C}} are the information entropy of data objects on numerical attributes and categorical attributes, respectively.\begin{equation*} E^{j^{N}}=-\frac {1}{\mathrm {ln}n}\sum \nolimits _{i=1}^{n} {\frac {x_{i}^{{j^{N}}^{\prime }}}{\sum \nolimits _{i=1}^{n} x_{i}^{{j^{N}}^{\prime }} }\mathrm {ln}\frac {x_{i}^{{j^{N}}^{\prime }}}{\sum \nolimits _{i=1}^{n} x_{i}^{{j^{N}}^{\prime }}}},\tag{3}\end{equation*} View SourceRight-click on figure for MathML and additional features. here, if x_{i}^{{j^{N}}^{\prime }}=0 , {\frac {x_{i}^{{j^{N}}^{\prime }}}{\sum \nolimits _{i=1}^{n} x_{i}^{{j^{N}}^{\prime }} }\mathrm {ln}\frac {x_{i}^{{j^{N}}^{\prime }}}{\sum \nolimits _{i=1}^{n} x_{i}^{{j^{N}}^{\prime }}}}=0 .

The values of data objects on the categorical attributes are expressed by \begin{equation*} X^{j^{C}}=\{x_{1}^{j^{C}},\quad x_{2}^{j^{C}},\ldots,x_{c}^{j^{C}},\tag{4}\end{equation*} View SourceRight-click on figure for MathML and additional features. where, the elements in the set X^{j^{C}} are mutually exclusive, x_{l}^{j^{C}}(l=1, 2, \ldots, c) indicates the l^{\mathrm {th}} value of the data object on the j^{\mathrm {th}} categorical attribute, and c indicates the number of data objects taking different values on the j^{\mathrm {th}} categorical attribute. The appearance frequency of x_{l}^{j^{C}} in X is X\mathrm {/}X^{j^{C}}=\{F_{1}^{j^{C}}, F_{2}^{j^{C}}, \ldots, F_{c}^{j^{C}} . Information entropy of a group of data objects in the categorical attribute is calculated by:\begin{equation*} E^{j^{C}}=-\frac {1}{\mathrm {ln}n}\sum \nolimits _{l=1}^{c} {\frac {F_{l}^{j^{C}}}{n}\mathrm {ln}\frac {F_{l}^{j^{C}}}{n}}.\tag{5}\end{equation*} View SourceRight-click on figure for MathML and additional features.

In order to eliminate the differences in dimensions and orders of magnitude of the original data objects on numerical attributes, the min-max normalization method [37] which linearly transforms the variables is used to process the numerical data objects, and the standardized ones are recorded as, \begin{equation*} x_{i}^{j^{N}}=\frac {x_{i}^{{j^{N}}^{\prime }}-\left \{{ x_{i}^{{j^{N}}^{\prime }} }\right \}}{\left \{{x_{i}^{{j^{N}}^{\prime }} }\right \}-\left \{{x_{i}^{{j^{N}}^{\prime }} }\right \}},\tag{6}\end{equation*} View SourceRight-click on figure for MathML and additional features. where, \left \{{x_{i}^{{j^{N}}^{\prime }} }\right \} and \left \{{ x_{i}^{{j^{N}}^{\prime }} }\right \} represent the maximum and minimum value of data objects regarding numerical attributes, respectively.

B. Constructing an Adaptive k-NN Global Weighted and Directed Network Model

The k-NN of the data object x_{i} is a collection of data objects, that is, the distance from x_{i} is less than or equal to the distance from x_{i} to its k^{\mathrm {th}} neighbor (represented by D_{i}^{\ast } ). It is defined as follows:\begin{align*}&\hspace {-.5pc}k-NN(x_{i})=\{x_{m}{\it \vert } (x_{m}\in X)\cap D_{im} \le D_{i}^{\ast }\} \\&(i, m=1, 2, \ldots, n, m \ne i),\tag{7}\end{align*} View SourceRight-click on figure for MathML and additional features. where, D_{im} is the distance between x_{i} and x_{m} . The Heterogeneous Euclidean-Overlap Metric [38] is used to calculate the distance among data objects on the mixed-valued attributes, that is \begin{equation*} D_{im}=\sqrt {\sum \nolimits _{j^{N}=1}^{d_{1}} {w^{j^{N}}\left ({d_{im}^{j^{N}} }\right)^{2}} +\sum \nolimits _{j^{C}=1}^{d_{2}} {w^{j^{C}}\left ({d_{im}^{j^{C}} }\right)^{2}}},\tag{8}\end{equation*} View SourceRight-click on figure for MathML and additional features. where, \begin{align*} d_{im}^{j^{N}}=&\left |{ x_{i}^{j^{N}}{-x}_{m}^{j^{N}} }\right |,\tag{9}\\ d_{im}^{j^{C}}=&\begin{cases} \displaystyle 0, & {x}_{i}^{j^{C}}= x_{m}^{j^{C}}; \\ \displaystyle 1, & {x}_{i}^{j^{C}}\ne x_{m}^{j^{C}}. \end{cases}\tag{10}\end{align*} View SourceRight-click on figure for MathML and additional features.

The value of k has a great impact on the accuracy of outlier detection results. Scholars have proposed plenty of methods to obtain it, however, most of them determine the value of k by combining parameter optimization algorithms and evaluation indexes on the premise that the outlier objects in the dataset are known. For unlabelled datasets, this kind of methods are difficult to give an appropriate k . Specially, the distribution characteristics of data objects are various for different datasets. In order to ensure the accuracy of detection results, the corresponding k needs to be determined according to the distribution characteristics of datasets. Therefore, based on the principle of the k-NN, an adaptive search algorithm is given for the appropriate value of k according to the distribution characteristics of datasets, shown as Algorithm 1. where, \Omega _{r}(x_{i}) is a collection of data objects, in which the element is the data object that regards x_{i} as the r -NN. N(\Omega _{r}(x_{i})) represents the number of the elements in \Omega _{r}(x_{i}) .

Algorithm 1 Automatic Search Algorithm for the Value of k

Input:

Dataset X , r =1 , \Omega _{r}(x_{i})=\emptyset , N(\Omega _{r}(x_{i}))\,\,=0 , P=0

Output:

the adaptive k

01.

for i\leftarrow 1 to n

02.

for m\leftarrow 1 to n

03.

D_{im}\leftarrow \sqrt {\sum \nolimits _{j^{N}=1}^{d_{1}} {w^{j^{N}}\left ({d_{im}^{j^{N}} }\right)^{2}} +\sum \nolimits _{j^{C}=1}^{d_{2}} {w^{j^{C}}\left ({d_{im}^{j^{C}} }\right)^{2}}}

04.

sort x_{m} based on D_{im} in ascending order for x_{i}

05.

end for

06.

end for

07.

while P=0 do

08.

for i\leftarrow 1 to n

09.

\Omega _{r}(x_{i})\leftarrow \{x_{m}\vert (m \ne i)(x_{i}\in r -NN(x_{m}) )}

10.

end for

11.

foreach x_{i} in X

12.

if exists N(\Omega _{r}(x_{i}))\,\,=0

13.

P\leftarrow 0

14.

else

15.

P\leftarrow 1

16.

end if

17.

end

18.

r =r + 1

19.

end while

20.

k =r-1

21.

return k

Then, an adaptive k-NN weighted and directed network model (shown as Fig. 1 (b))\,\,M=(X , E) is constructed based on the k-NN of each data object, where the elements of X \,\,= (x_{1} , x_{2}, \ldots, x_{n} ) represent the nodes (data objects) in the network model, E is a binary matrix and the element E_{im} in it represents that exists an edge from x_{i} to its neighbor x_{m} , that is:\begin{align*} E_{im}=\begin{cases} \displaystyle 1, &\mathrm {if}~x_{m}\in k-NN(x_{i}\mathrm {);} \\ \displaystyle 0, &\mathrm {otherwise.} \end{cases}\tag{11}\end{align*} View SourceRight-click on figure for MathML and additional features.

FIGURE 1. - Schematic diagram of the network model.
FIGURE 1.

Schematic diagram of the network model.

The adaptive k-NN weighted and directed network of the dataset can be represented by an adjacent matrix A under the rule that: if A_{im} >0 , then there will be a directed edge from x_{i} to x_{m} , and the weight on this edge is A_{im} . The elements in matrix A are defined as follows:\begin{align*} A_{im}=\begin{cases} \displaystyle 1-D_{im},&if~E_{im}\mathrm {=1;} \\ \displaystyle 0,&if~E_{im}\mathrm {=0.} \\ \displaystyle \end{cases}\tag{12}\end{align*} View SourceRight-click on figure for MathML and additional features.

In this network, it is assumed that there is an edge between one node and its k-NN, which is from the node to its k-NN, and the weight of the connecting edge is defined by Eq. (12). As shown in Fig. 1 (b ), it is found that there are two sub networks. With the different distribution of datasets, there may be multiple sub networks, or isolated nodes, etc. Therefore, a node x_{G} is introduced, and it is assumed that there are bidirectional edges between x_{G} and x_{i} (i=1, 2, \ldots, n ), so as to form an adaptive k-NN global weighted and directed network, as shown in Fig. 1 (c ).

The edge weights of the bidirectional edges between x_{G} and x_{i} (i=1, 2, \ldots, n ) are expressed as, \begin{align*} \omega _{iG}=&\left \{{A_{im},\left ({A_{im}\mathrm {\ne 0} }\right) }\right \},\tag{13}\\ \omega _{Gi}=&\sum \nolimits _{m=1}^{n} A_{im}.\tag{14}\end{align*} View SourceRight-click on figure for MathML and additional features.

The adjacency matrix A_{G} of the adaptive k-NN global weighted and directed network is denoted as, \begin{align*} \boldsymbol {A}_{G}=\left [{ {\begin{array}{cccccccccccccccccccc} 0 &\quad A_{12} &\quad \cdots &\quad A_{1n} &\quad \omega _{1G}\\ A_{21} &\quad 0 &\quad \cdots &\quad A_{2n} &\quad \omega _{2G}\\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots &\quad \vdots \\ A_{n1} &\quad A_{n2} &\quad \cdots &\quad 0 &\quad \omega _{nG}\\ \omega _{G1} &\quad \omega _{G2} &\quad \cdots &\quad \omega _{Gn} &\quad 0\\ \end{array}} }\right].\tag{15}\end{align*} View SourceRight-click on figure for MathML and additional features.

C. Performing a Random Walk Process to Identify Outliers

In a specific network, a random walk process is defined as a stochastic process that a random walker moves from x_{i} to x_{m} (i , m=1, 2, \ldots, n , m \ne i ) in the next random step with specific probability. And the transition probability of the random walker moving from one node to another only depends on the current state and remains unchanged throughout the process, expressed as:\begin{align*} p_{im}^{t}={p}_{im}^{t-1}= p\left ({x^{t}=x_{m}\mathrm {\vert }x^{t-1}=x_{i} }\right)\left ({\forall i:\sum \nolimits _{m} p_{im} =1 }\right). \\{}\tag{16}\end{align*} View SourceRight-click on figure for MathML and additional features. where x^{t} , x^{t-1} represent the position of the random walker at time step t and t -1 , respectively. p_{im}^{t} represents the transition probability of the random walker moving from x_{i} to x_{m} at time step t .

The adaptive k-NN global weighted and directed network represented by an adjacent matrix A_{G} uniquely defines a random walk process. The transition probability matrix is obtained from A_{G} : \begin{equation*} \mathbf {P}_{G} = \mathbf {A}_{G} \times \mathbf {D}^{-1},\tag{17}\end{equation*} View SourceRight-click on figure for MathML and additional features. where D is a diagonal matrix, and each element in the matrix is equal to the sum of the corresponding rows of A_{G} .

The random walker is expected to jump to a neighbor with a greater similarity to the current node at the next time step. Using the edge weights between nodes to customize the transition probability of the random walk process can achieve the above effect and effectively complete the transition process between nodes. Starting from any state at any time, the random walk process will converge to an equilibrium state after a certain number of iterations, and the probability of the random walker at each node will not change. Using an iterative method, the stationary distribution vector of the random walk process on the adaptive k-NN global weighted and directed network can be estimated. The iterative process is formalized as follows:\begin{equation*} \boldsymbol {\pi }^{t} = \boldsymbol {\pi }^{t-1} \times \mathbf {P}_{G},\tag{18}\end{equation*} View SourceRight-click on figure for MathML and additional features. where, the element in \boldsymbol {\pi }^{t} = (\pi _{1}^{t} , \pi _{2}^{t}, \ldots, \pi _{n}^{t} , \pi _{G}^{t} ) represents the probability of a random walker remaining at the node at time step t . And \boldsymbol {\pi } ^{0} = \left ({0 }\right) is an initialized random probability vector.

After the random walk process reaches equilibrium, each element in the stationary distribution vector \pi ^{t} can be explained as the visited probability for the corresponding node in the adaptive k-NN global weighted and directed network. Based on the constraints on the behavior of the random walkers, it can be inferred that the potential outlier notes will have less chances to be visited by the random walker, therefore they will be assigned relative smaller scores in the stationary distribution vector. Considering the influence of x_{G} , an index (outlier score) is developed by assigning the visited probability of x_{G} to each real node to measure the outlier degree of data objects, which is denoted as,\begin{equation*} \Psi _{i}=\frac {1}{\pi _{i}^{t}+\frac {\omega _{Gi}}{\sum \nolimits _{i=1}^{n} \omega _{Gi}}\pi _{G}^{t}}.\tag{19}\end{equation*} View SourceRight-click on figure for MathML and additional features.

In order to ensure the convergence of the random walk process, x_{G} is added in the construction process to ensure the connectivity of the network model, and non-existent edge pointing to itself is restricted on the network model to avoid self- reinforcement. The transition probability matrix of the random walk process is defined by the similarity between the data object and its neighbors. When the random walker is located on a real node, it always jumps to its neighbor with the greatest similarity at the next time step. The setting of Eqs. (11)–​(14) ensures the completion of the following process, that is, the random walker always jumps with a greater probability to the neighbor with the greatest similarity, x_{G} , and the node which has the greatest similarity with its all neighbors at the next time step, when it is located at the real node, the isolated node, and x_{G} at the current time, respectively. The value in the stationary distribution vector is used to express the probability that a node is accessed when the random walk process reaches a stable state. Based on the transition mechanism of the random walk process, the potential outliers should get a smaller access probability. In order to eliminate the influence of x_{G} on the detection results, the visited probability of x_{G} to real nodes based on Eq. (13) and Eq. (14) is assigned to define outlier score (\Psi _{i} ). It can be seen from Eq. (19) that the larger \Psi _{i} is, the more outlying x_{i} tends to be.

The proposed outlier detection methodology is presented in Algorithm 2.

Algorithm 2 Calculate the Outlier Score for Each Data Object

Input:

Dataset X , k

Output:

outlier score of each data object

01.

for i\leftarrow 1 to n

02.

k-NN(x_{i})\leftarrow \{x_{m}\vert (m \ne i)(x_{m}\in X)\cap D_{im} \le D_{i}^{\ast } }

03.

end for

04.

for i\leftarrow 1 to n

05.

for m\leftarrow 1 to n

06.

\begin{aligned} E_{im}\leftarrow {\begin{cases} 1,\,\, \mathrm {if}\,\,x_{m}\in k-\mathrm {NN(}x_{i}\mathrm {);} \\ 0,\,\,\mathrm {otherwise.} \\ \end{cases}} \end{aligned}

07.

\begin{aligned} A_{im}\leftarrow {\begin{cases} 1-D_{im},\,\, \mathrm {if}\,\,E_{im}\mathrm {=1;} \\ 0,\,\,\mathrm {if}\,\,E_{im}\mathrm {=0.} \\ \end{cases}} \end{aligned}

08.

end for

09.

end for

10.

for i\leftarrow 1 to n

11.

\omega _{iG}\leftarrow \left \{{A_{im},\left ({A_{im}\ne 0 }\right) }\right \}

12.

\omega _{Gi}\leftarrow \sum \nolimits _{m=1}^{n} A_{im}

13.

end for

14.

{\text{P}}_{G}\leftarrow {\mathbf{A}}_{G} \times \,\,\text{D}^{-1}

15.

\pi ^{t}\leftarrow \pi ^{t-1} \times \,\,\text{P}_{G}

16.

for i\leftarrow 1 to n

17.

\Psi _{i}\leftarrow \frac {1}{\pi _{i}^{t}+\frac {\omega _{Gi}}{\sum \nolimits _{i=1}^{n} \omega _{Gi}}\pi _{G}^{t}}

18.

end for

19.

return \Psi _{i}

The proposed method first searches the value of k , which has an O(n^{2}) complexity. Next to calculate the neighborhood relations in universe X with O(n^{2}) complexity. Then, to compute the adjacent matrix of the k-NN weighted and directed network with an O(n^{2}) complexity. The iterative method to compute the stationary distribution vector has an O(n^{2}) complexity, and the last step to distribute the abnormal score has an O(n) complexity. To sum these up, the total time complexity for the algorithm is O(n^{2}) .

SECTION IV.

Experiments

The proposed method is experimentally verified on three UCI datasets with three related outlier detection methods: the neighborhood information entropy-based outlier detection method (NIEOD) [39], the outlier detection using random walks (OutRank) [23], and the virtual outlier score model (VOS) [33], and the experimental environment is shown in Table 2.

TABLE 2 Experimental Environment
Table 2- 
Experimental Environment

The NIEOD is proposed to detect outliers in the dataset with numerical, categorical and mixed-valued data by using the neighborhood information system and information entropy. The heterogeneous distance and self-adapting radius are applied to determine the neighborhood information system of the dataset, the neighborhood information entropy, relative neighborhood entropy, deviation degree, and outlier factor are further constructed based on the neighborhood information around each data object to measure the outlier degree of each data object. It has extended the outlier detection methods which are based on the traditional distance and rough set, and more applies to the datasets with some uncertainty mechanisms. There are two parameters in this method, namely, the neighborhood radius adjustment parameter and the judgement threshold.

The OutRank explores the application of the random walk process in outlier detection for the dataset with numerical data, the random walk process can effectively capture not only the uniformly dispersed outliers but also small clusters of outliers. It builds a weighted and undirected neighborhood network model and uses the cosine similarity between objects and the shared-nearest neighbor density to define similarity metric, respectively. It has yield higher detection rates with lower false alarm rates than the outlier detection methods based on distance and density on both real and synthetic datasets.

The VOS improves the outlier detection methods based on the random walk process, combines the k-NN with the network model to identify outliers in the dataset with numerical attributes. It uses the top-k similar neighbors of each data object to construct the network model, and implements outlier detection by executing a tailored random walk process. By making full use of the local information of each data object and considering the outlier degree of each data object from a global perspective, the effectiveness of this method is demonstrated theoretically.

The proposed method in this paper can identify the outliers in a dataset with numerical, categorical or mixed-valued data integrating the advantages of the above methods, which overcomes the defects of these methods to a certain extent. And, the predetermined parameter such as k is not necessary in the proposed method.

A. UCI Datasets

Glass Identification dataset, Hayes-Roth dataset, and Lymphography dataset in the UCI machine learning library [40] are applied to evaluate the performance of the proposed method. These datasets are marked for classification and the rare classes are known, the data objects in the rare classes are considered as outliers.

Glass Identification dataset contains 214 data objects, which described by 1 name attribute, 9 numerical attributes, and 1 class attribute. The 214 data objects are divided into 6 categories with 70, 76, 17, 13, 9, and 29 data objects in each category, respectively. The class 5 has the fewest objects and can be treated as the rare class, in which the data objects are outliers.

Hayes-Roth dataset contains 132 data objects with 1 name attribute, 4 categorical attributes, and 1 class attribute. The 132 data objects are divided into 3 categories and each category has 51, 51, and 30 data objects, respectively. The class 3 has the fewest objects and can be treated as the rare class, in which the data objects are outliers.

Lymphography dataset contains 148 data objects, which described by 3 numerical attributes, 15 categorical attributes, and 1 class attribute. The 148 data objects are divided into 4 categories with 2, 81, 61, and 4 data objects in each category, respectively. The class 1 and class 4 have the fewest objects and can be treated as the rare classes, in which the data objects are outliers.

The data distribution of the above three datasets is shown in Fig. 2 (a ), (b ), and (c ) respectively.

FIGURE 2. - Distribution of data objects on the above three datasets.
FIGURE 2.

Distribution of data objects on the above three datasets.

B. Evaluation Metrics for the Performance of Methods

In order to quantitatively analyze the experimental results of different outlier detection methods, three traditional information system quality metrics “precision”(Pre), “recall”(Rec) and “rank power”(RP) are used in this paper [41].

Pre calculates the proportion of the real outliers in the dataset identified as outliers in the first z data objects:\begin{equation*} {\textit{Pre}} = \frac {N_{identified}}{z},\tag{20}\end{equation*} View SourceRight-click on figure for MathML and additional features. where, N_{identified} represents the number of the real outliers identified as outliers in the first z data objects.

Rec measures the percentage of the real outliers identified in the first z data objects and all real outliers in the dataset:\begin{equation*} {\textit{Rec}} = \frac {N_{identified}}{N_{real}},\tag{21}\end{equation*} View SourceRight-click on figure for MathML and additional features. where, N_{real} is the number of the real outliers in the dataset.

Pre and Rec estimate the accuracy of detection results of an outlier detection method, but neither of them can accurately compare the quality of detection results of different methods. For example, Pre and Rec of two real outliers identified by the method at the first two positions are the same as those at any two positions in the first z data objects. The location, where the real outliers are identified, is usually an important factor in comparing different outlier detection methods. Therefore, RP is introduced to measure simultaneously the position and number of the real outliers:\begin{equation*} {\textit{RP}} = N_{identified}\frac {N_{identified}+1}{2\sum \nolimits _{L=1}^{N_{identified}} O_{L}},\tag{22}\end{equation*} View SourceRight-click on figure for MathML and additional features. where RP\in [0, 1], O_{L} represents the position of the L^{\mathrm {th}} real outlier. In particular, RP = 1 if and only if all real outliers are at the top of the objects selected by the outlier detection method. Obviously, larger RP means better performance of the outlier detection method.

Pre and Rec are positively correlated with the effectiveness of outlier detection methods. When Pre and Rec are the same, the larger the RP is, the more effective the outlier detection method is.

C. Experiment Results

Here, the detection results of the proposed method with those of other three methods on three different types of UCI datasets are compared by the above indexes. The above four methods no longer classify the data objects into normal objects and outliers. Instead, each data object is assigned a score to measure the outlier degree, and then the data objects are sorted (in the ascending or descending order) based on the judgment mechanism of the method. The first z data objects which may include the real outlier and/or the identified (but not real) outlier are chosen to verify the performance of the methods.

The scores of data objects in the dataset based on the proposed method, the NIEOD, and the VOS are sorted in a descending order, while based on the OutRank is sorted in an ascending order considering the different mechanism of each method to construct the index to measure the outlier degree of data objects. With the increase of the first z objects selected, the number of the real outliers identified by the above methods is different, and the detection results on the three datasets are shown in Table 3, Table 4 and Table 5, respectively.

TABLE 3 Detection Results of the Four Outlier Detection Methods on the Glass Identification Dataset
Table 3- 
Detection Results of the Four Outlier Detection Methods on the Glass Identification Dataset
TABLE 4 Detection Results of the Four Outlier Detection Methods on the Hayes-Roth Dataset
Table 4- 
Detection Results of the Four Outlier Detection Methods on the Hayes-Roth Dataset
TABLE 5 Detection Results of the Four Outlier Detection Methods on the Lymphography Dataset
Table 5- 
Detection Results of the Four Outlier Detection Methods on the Lymphography Dataset

There are some supplementary explanations of detection results in Table 3, Table 4 and Table 5. “Top ratio” indicates the proportion of the first z data objects selected in the whole dataset, and “Coverage” represents the proportion of the real outliers identified in the first z data objects in all real outliers in the dataset. “Mean” measures the average number of the real outliers identified by each method when the four methods can identify all real outliers in the dataset in the first z data objects selected.

For better visualization, the detection results on different datasets in Table 3, Table 4 and Table 5 are illustrated in Fig. 3 (a ), (b ), and (c ), respectively, and the Mean of outlier detection methods on UCI datasets is shown as Fig. 4.

FIGURE 3. - Detection results on three datasets based on the four methods.
FIGURE 3.

Detection results on three datasets based on the four methods.

FIGURE 4. - Mean of detection results on datasets for the four outlier detection methods.
FIGURE 4.

Mean of detection results on datasets for the four outlier detection methods.

The analysis of detection results based on evaluation indexes for each method on the three datasets are shown in Table 6, Table 7 and Table 8, respectively, and the corresponding results are shown in Fig. 5, Fig. 6, and Fig. 7, respectively.

TABLE 6 - 
Analysis of Detection Results on the Glass Identification Dataset
TABLE 7 - 
Analysis of Detection Results on the Hayes-Roth Dataset
TABLE 8 - 
Analysis of Detection Results on the Lymphography Dataset
FIGURE 5. - Analysis results of the four methods on the Glass Identification dataset.TABLE 6
Analysis of Detection Results on the Glass Identification Dataset

$z$
Proposed methodNIEODOutRankVOSPreRecRPPreRecRPPreRecRPPreRecRP30.330.110.330.330.110.3300000050.40.220.380.20.110.330000.20.110.270.430.330.40.140.110.330000.140.110.2170.240.440.310.060.110.330.120.220.110.120.220.18300.170.560.240.10.330.210.10.330.120.070.220.18420.190.890.200.070.330.210.070.330.120.050.220.181230.0710.150.050.670.070.050.670.080.040.560.061500.0610.150.0610.060.040.670.080.030.560.062000.0510.150.0510.060.030.670.080.0510.042130.0410.150.0410.060.0410.050.0410.04TABLE 7
Analysis of Detection Results on the Hayes-Roth Dataset

$z$
Proposed methodNIEODOutRankVOSPreRecRPPreRecRPPreRecRPPreRecRP160.880.470.790.440.230.560.50.270.510.310.170.56180.830.50.790.390.230.560.440.270.510.330.20.48240.830.670.800.290.230.560.380.30.50.290.230.42280.820.770.810.250.230.560.320.30.50.290.270.40350.740.870.790.290.330.370.260.30.50.260.30.36390.740.930.780.280.370.350.260.330.440.280.370.33490.6110.760.290.470.320.200.330.440.290.470.31800.3810.760.250.670.300.210.570.250.210.570.281200.2510.760.230.90.280.220.870.230.220.870.241320.2310.760.2310.270.2310.230.2310.23TABLE 8
Analysis of Detection Results on the Lymphography Dataset

$z$
Proposed methodNIEODOutRankVOSPreRecRPPreRecRPPreRecRPPreRecRP310.510.670.3310.330.17100060.670.670.830.50.50.860.330.330.5000280.180.830.3750.2110.540.2110.340.070.330.12310.1910.300.1910.540.1910.340.060.330.12500.1210.300.1210.540.1210.340.060.50.091000.0610.300.0610.540.0610.340.040.670.081150.0510.300.0510.540.0510.340.0510.06
FIGURE 5.

Analysis results of the four methods on the Glass Identification dataset.

TABLE 6 Analysis of Detection Results on the Glass Identification Dataset
Table 6- 
Analysis of Detection Results on the Glass Identification Dataset
TABLE 7 Analysis of Detection Results on the Hayes-Roth Dataset
Table 7- 
Analysis of Detection Results on the Hayes-Roth Dataset
TABLE 8 Analysis of Detection Results on the Lymphography Dataset
Table 8- 
Analysis of Detection Results on the Lymphography Dataset

FIGURE 6. - Analysis results of the four methods on the Hayes-Roth dataset.
FIGURE 6.

Analysis results of the four methods on the Hayes-Roth dataset.

FIGURE 7. - Analysis results of the four methods on the Lymphography dataset.
FIGURE 7.

Analysis results of the four methods on the Lymphography dataset.

The running time of the four outlier detection methods on the three datasets are exhibited in Table 9.

TABLE 9 Experiments Results of Running Time ( s )
Table 9- 
Experiments Results of Running Time (
$s$
)

D. Discussion and Analysis

The acquisition details of the experiment results in Section 4.3 are as follows:

1) Glass Identification Dataset

As shown in Table 3, Fig. 3 (a ), and Fig. 4, the number of the real outliers identified by the proposed method is the largest compared with the other three methods with the increase of the first z data objects. When the first 123 (57%) data objects are selected, the proposed method identifies all real outliers in the dataset, and the other three methods identify 6, 6 and 5 real outliers, respectively. And they identify all real outliers when the first 150 (70%), 213 (99%), and 200 (93%) data objects are selected, respectively. When the first 213 data objects are selected, the four methods identify all real outliers in the dataset, and the average number of the real outliers identified by each method is 5.9, 4.3, 3.5, and 3.6, respectively. From Fig. 5, it can be found that when the selected data objects are less than or equal to 150, the Pre and Rec of the proposed method are larger than the other three methods. When the value of z is larger than 150, the four methods have the same values of Pre and Rec, but the proposed method has a larger RP value than the other three methods. Besides, the running time shown as in Table 9 of the proposed method on this dataset is only 4.2 s , while the other three running times are 207.5~s , 15.1 s , and 8858.4 s , respectively. In conclusion, the proposed method in this paper has better detection performance than the NIEOD, the OutRank, and the VOS on the Glass Identification dataset, which is a dataset with only numerical attributes.

2) Hayes-Roth Dataset

The proposed method detects all real outliers in the dataset when the first 49 (37%) data objects are selected, while the other three methods detect 14, 10, and 14 data objects as shown in Table 4 and Fig. 3 (b ), respectively. When the value of z is 132 (100%), the NIEOD, the OutRank, and the VOS detect all real outliers. In Fig. 4, the Mean of the proposed method is 24.6 which significantly higher than that of the comparative methods. With the increase of the first z data objects selected, the Pre and Rec of the proposed method are significantly larger than the other three methods. When z is 132, their Pre and Rec reach the same. At this time, the RP of the proposed method is the largest among them as shown in Fig. 6 (c ). Moreover, the proposed method has the shortest running time of 0.87 s , and the other methods are 1.7 s , 1.5 s , and 306.4 s , respectively. Compared with the other three methods, the proposed method has obvious advantages in identifying the number of the real outliers, Pre, Rec, and running time on the Hayes-Roth dataset which only contains categorical attributes.

3) Lymphgraphy Dataset

From Table 5, Fig. 3 (a ), and Fig. 4, the proposed method identifies 3 real outliers in the dataset when the first 3 (2%) data objects are selected, while the NIEOD, the OutRank, and the VOS identify 2, 1, and 0, respectively. The four methods detect 4, 3, 2, and 0 real outliers respectively when the value of z is 6 (4%). The NIEOD and the OutRank identify all real outliers first when the selected data objects are 28, the proposed method and the VOS identify 5 and 2, respectively. And the proposed method detects all real outliers in the dataset when the value of z is 31. However, the Mean of the proposed method is slightly higher than the other three methods. In Table 8 and Fig. 7, the Pre and Rec of the proposed method are the highest when the value of z is less than 28, consistent with the NIEOD and the OutRank when the selected data objects are greater than 28, and slightly lower than the NIEOD and the OutRank only when the first 28 data objects are selected. Additionally, the running time of the proposed method is 1.1 s , and the other methods are 4.7 s , 5.1 s , and 302.0 s , respectively. Therefore, the proposed method can be applied to identify the real outliers in the Lymphgraphy dataset, which has simultaneously numerical and categorical attributes.

Based on the above observations, the following conclusions can be drawn:

  1. The proposed method can be used to detect outliers in different types of datasets, including the dataset with numerical attributes, categorical attributes, and mixed-valued attributes.

  2. The adaptive k is obtained automatically according to the distribution characteristics of the dataset, which ensures higher quality of outlier mining and reduces the cost in the process of parameter adjustment.

  3. In the proposed method, the k-NN is used to mine the local information of data objects, and a customized random walk process is used to explore the long-term correlation between related data objects from a global perspective. The combination of the k-NN and the random walk process improves the accuracy of detection results.

SECTION V.

Conclusion

In this paper, an unsupervised outlier detection method based on the adaptive k-NN global network is proposed to identify the outliers in a DMA. First, the process of determining the weight of numerical and categorical attributes is given respectively to reduce the influence of different attributes on data objects, and the spatial distance between mixed-valued data objects is measured based on the Heterogeneous Euclidean-Overlap Metric. Second, an adaptive search algorithm for the appropriate value of k is introduced, which can automatically obtain the k according to the distribution of data objects in different datasets. And the k -NN of each data objects is obtained. Next, a network model is constructed based on the neighborhood relationship between data objects, and a customized similarity measurement is applied to calculate the edge weight of the network, in which the edge weight is directly proportional to the similarity between the data object and its neighbor. Then, a special random walk process is performed on the network model by defining the transition probability matrix using the edge weight. After the random walk process reaching the equilibrium state, outlier score (\Psi _{i} ) is constructed to measure the outlier degree of each data object. Finally, a detailed empirical study is devised to illustrate the effectiveness, accuracy, and applicability of our method in detecting outliers using three typical UCI datasets. The proposed method has higher Pre and Rec in the detection results compared with the other three methods. It can be employed in the dataset with numerical attributes, categorical attributes or mixed-valued attributes.

However, calculating the stationary distribution vector is a time-consuming process, which limits the application of the outlier detection methods based on the random walk process to the large-scale and stream dataset. The main work in the next stage is how to apply this method to practical application scenarios with large datasets.

References

References is not available for this document.