Unsupervised outlier detection for mixed-valued dataset based on the adaptive k-Nearest Neighbor global network

Outlier detection aims to reveal data patterns different from existing data. Benefit from its good robustness and interpretability, the outlier detection method for numerical dataset based on k-Nearest Neighbor (k-NN) network has attracted much attention in recent years. However, the datasets produced in many practical contexts tend to contain both numerical and categorical attributes, that are, the datasets with mixed-valued attributes (DMAs). And, the selection of k is also an issue that is worthy of attention for unlabeled datasets. Therefore, an unsupervised outlier detection method for DMA based on an adaptive k-NN global network is proposed. First, an adaptive search algorithm for the appropriate value of k considering the distribution characteristics of datasets is introduced. Next, the distance between mixed-valued data objects is measured based on the Heterogeneous Euclidean-Overlap Metric, and the k-NN of a data object is obtained. Then, an adaptive k-NN global network is constructed based on the neighborhood relationships between data objects, and a customized random walk process is executed on it to detect outliers by using the transition probability to limit behaviors of the random walker. Finally, the effectiveness, accuracy, and applicability of the proposed method are demonstrated by a detailed experiment.


I. INTRODUCTION
As an important task in data mining, the purpose of outlier detection is to reveal data patterns different from existing data [1]. An outlier can be defined as "an observation which deviates so much from other observations as to arouse suspicions that is generated by a different mechanism [2]". Another definition is given by Barnett and Lewis, "an outlier is an observation (or a set of observations) which appears to be inconsistence with the remainder of the given dataset [3]". Outliers can be anomalies, novelties, noise, deviations, and exceptions [4]. And an outlier usually represents a new perspective or a specific mechanism which attracts higher interest than the normal instances. Therefore, outlier detection has been widely used in different domains, e.g., human activity recognition [5]; credit card fraud detection [6]; medical diagnosis [7]; video detection [8] and fault diagnosis [9].
Generally, according to whether the dataset is labelled or not, outlier detection methods can be roughly divided into three categories, namely supervised methods [10], semi-supervised methods [11] and unsupervised methods [12]. Most of the datasets collected in real engineering contexts are unlabeled, and the labeling are problematic or cost unacceptable. Therefore, unsupervised outlier detection method is very popular because it does not require a labelled training dataset. In recent decades, various unsupervised outlier detection technologies have been proposed, most of which are based on the nearest neighbor of data objects [13]. There exist two kinds of the nearest neighbor concepts, i.e., ε-Nearest Neighbor and k-Nearest Neighbor (k-NN) [14], among which, the k-NN is more widely adopted. The core idea of the k-NN is to select a specific k for a dataset and find k data objects with the greatest similarity or the shortest distance from each data object in the dataset. When the value of k is too large, the neighbors of each data object will contain useless information or even lead to errors in subsequent algorithms. Conversely, if the value of k is too small, the data object will have fewer neighbors and contain limited useful information, which will reduce the accuracy of the algorithm. For different datasets, the value of k is often different in order to ensure the optimal performance of the outlier detection results. Therefore, a search algorithm for the appropriate value of k is proposed to automatically determine the value of k for different distributed datasets.
Besides, the datasets often include both numerical and categorical attributes simultaneously, that are, datasets with mixed-valued attributes (DMAs) for many real-world problems. For example, some unpredictable outliers often appear in the actual operation of the warehousing system, especially in the emerging warehousing system, which increase the liabilities of the warehousing industry [15]. The abnormal operation status in a warehousing system is described as an object with mixed-valued attributes so as to realize the health diagnosis of the warehouse system [16]. More and more scholars focus on the outlier detection methods for the DMA to solve practical problems. However, most outlier detection methods based on the k-NN are developed to handle the datasets with only numerical attributes [17][18] [19]. Therefore, the analysis and research of outlier detection methods for the DMA based on the k-NN is theoretically and practically significant.
Furthermore, the random walk process has been widely used for a variety of information retrieval tasks, including web search [20], keyword extraction [21], and text summarization [22]. These methods usually build a network model for data objects and perform a random walk process on it to evaluate the centrality or importance of each data object. Moonesinghe and Tan [23] applied the random walk process to outlier detection, and verified the accuracy of the detection results through real datasets and synthetic datasets. The k-NN focuses on the local information of each data object and ignores the possible internal connections in the whole dataset. The random walk process on the network model just makes up for this drawback and emphasizes the overall or partial structure of the dataset. The combination of the k-NN and a random walk process can relevantly characterize the relationships between data objects and the hermit connections in the dataset [24]. Motivated and inspired by the above observations, an unsupervised outlier detection method based on an adaptive k-NN network to handle the DMA is proposed in this study. First, considering the influence of different attributes on data objects, the heterogeneous distance function is combined to measure the spatial distance between mixed-valued data objects in a DMA. Second, an adaptive search algorithm for the appropriate value of k is proposed according to the distribution characteristics of the dataset, then the k-NN for each mixed-valued data object is obtained. Then, an adaptive k-NN global weighted directed network model is constructed based on the neighborhood relationships between data objects, and a customized random walk process is implemented on it. After the random walk process converges to an equilibrium state, the element in the stationary distribution vector is used to construct the outlier score of each data object. Finally, the proposed method is compared with other existing related outlier detection methods on three different types of UCI datasets, and the results show that the proposed method is more effective, accurate, and applicable.
The main contributions of this paper are threefold.
(1) We propose an adaptive search algorithm for the value of k based on the k-NN. It can automatically search the appropriate value of k according to the distribution characteristics of data objects in different datasets. This algorithm enables the unsupervised mechanism in the proposed outlier detection method for the DMA.
(2) We combine the k-NN with a random walk process to construct the outlier score for mixed-valued data objects. The k-NN obtains the information around each data object, and the random walk process on the network model explores the relationships between data objects from the perspective of the whole dataset. In this way, we can not only make full use of the local information of each mixedvalued data object, but also consider the outlier degree of each data object from the global perspective.
(3) We validate the effectiveness, accuracy, and applicability of the proposed methodology with three other related outlier detection methods by using three UCI datasets with different data types, which contain a dataset with numerical attributes, a dataset with categorical attributes, and a DMA. Several evaluation metrics that include precision, recall, rank power, and time consumption are employed to evaluate the performance of the methods in the experiments.
The remainder of this paper is organized as follows. The related works are presented in Section 2. The proposed methodology is detailed in Section 3. The detailed experiment on three types of UCI datasets is implemented in Section 4. The conclusions and future work are provided in Section 5.

II. RELATED WORKS
In this section, two kinds of outlier detection methods related to this study are investigated: (1) outlier detection methods based on the k-NN and (2) outlier detection methods based on random walk process.

A. OUTLIER DETECTION BASED ON THE k-NN
For decades, scholars have contributed a lot on outlier detection based on the k-NN. Dong and Yan [25] propose a multivariate outlier detection method based on the k-NN. They use the gamma index to select the sub sample data containing the maximum normal data. According to the position and discreteness of the sub sample data, the robust Mahalanobis distance is calculated to distinguish the normal data from outliers. Wang et al. [19] give each data object a local outlier score and a global outlier score based on the k-NN medoid to measure whether a data object is an outlier. Muthukrishnan [26] introduces the Reverse Nearest Neighbor (RNN) which lays a foundation for the RNNbased outlier detection method. Uttarkabat et al. [13] propose an outlier detection method based on the RNN to address the outlier issue where an outlier point located in between a dense cluster and close to a sparse cluster. Using the statistical information of the RNN and the k-NN, the distance factor is defined to measure the outlier degree of data objects. The generalized form of the RNN is reverse k-NN (Rk-NN). Cao et al. [27] propose a novel stream outlier detection method based on the Rk-NN to avoid multi-scan of the dataset and to capture concept drift, in which, the update of the current window because of insertion and deletion needs only one scan. Batchanaboyina and Devarakonda [28] propose an efficient outlier detection approach using improved monarch butterfly optimization and mutual nearest neighbors. Among it, the class marks of unknown data objects are defined by mutually next to one another, instead of by the nearest neighbor. The advantage of mutual neighbors is that in the course of the prediction process pseudo near neighbors can be identified and taken not into account.
However, the value of k is still an important factor affecting the accuracy of outlier detection results in this kind of methods. In order to deal with the problem of parameter sensitivity, some novel strategies have been proposed. Inspired by the concept that outlying objects are less easily selected than inlying objects in blind random sampling, Ha et al. [29] propose a method based on the entropy to measure the observability factor of each iteration, and optimize the value of parameter k. However, the algorithm still has parameters, such as iteration times and sampling size, which does not fundamentally eliminate the dependence on parameters. Ning et al. [30] propose a parameter selection method based on a mutual neighbor graph. The number of cliques (complete graphs) in the mutual neighbor graph is used to search the stable state. When it reaches the stable state, the appropriate value of k can be found, but this is at the expense of the accuracy and efficiency of identification.
In conclusion, the k-NN which can effectively express the local information around the data objects is widely used in outlier detection methods and its effectiveness is proved by a large number of studies. However, the selection of the value of parameter k is still an issue that needs to be tackled.

B. OUTLIER DETECTION BASED ON RANDOM WALK PROCESS
In order to broaden the application fields of the random walk process, Moonesinghe and Tan [23] apply it to outlier detection and propose two strategies for constructing networks, using appropriate similarity measures and the number of shared neighbors, respectively. The accuracy of the detection results is verified by real datasets and synthetic datasets. Afterwards, people have further explored the outlier detection methods based on the random walk process. Berton et al. [31] propose a method to measure the outlier degree of nodes in complex networks by simultaneously considering both local and global information of each node based on distance measure of the random walk process of a Brownian particle and the dissimilarity index. Liu et al. [32] propose a method based on the random walk process to identify the spatial outliers. Two weighted graphs are established according to the spatial and non-spatial attributes of spatial objects, respectively, and the correlation score between spatial objects is calculated by the random walk process. Using the analysis results, the outlier score of each data object is calculated, and the first k objects are identified as outliers. Wang et al. [24] construct a weighted directed network based on the k-NN of each data object, and apply a random walk process on it. Two different types of restart vectors are proposed to tackle the dangling-link caused by isolated nodes, and the outlier score for each data object can be calculated after the random walk process reaches equilibrium. Afterwards, Wang et al. [33] introduce a new outlier detection model named virtual outlier score, which constructing a virtual network based on the k-NN of data objects and a virtual point, and employing a tailored Markov random walk process on it to define outlier score for data objects. Li et al. [34] claim that the similarity matrix is learned by minimizing the reconstruction error of kernel matrix of the data, and the similarity matrix is constrained by double kernel norm and Frobenius norm. And a weighted directed network is constructed based on the similarity matrix, the random walk process is used on it to identify abnormal points. In these methods, each data object is modelled as a node in a network, and the relationship between objects is defined as an edge that connects the nodes. The nodes and edges in a network are analyzed by deeply mining the characteristics of topological structure to define the outlier score of each data object [35].
From the above, the keys of this kind of methods are how to build the network model and how to determine the index to measure the outlier degree of data objects. Scholars mostly build the network model based on the neighbourhood system of each data object, and the combination of the k-NN and the random walk process is only used to deal with the dataset with numerical attributes, and the outlier detection results are affected by the value of k. Therefore, this paper proposes an unsupervised outlier detection method which can deal with the DMA combining the k-NN and the random walk process.

III. METHODOLOGY
The proposed methodology will be discussed in detail in this section, which includes three parts: a) data preprocessing, b) constructing an adaptive k-NN global weighted directed network model, and c) performing a random walk process to identify outliers.

A. DATA PREPROCESSING
Let X={x 1 , x 2 , …, x n } be a set of n mixed-valued data objects, and A={a 1 , a 2 , …, a d } be a set of attributes. Each data object is described by d 1 numerical attributes and d 2 categorical attributes, and represent the value of the i th data object on the j th numerical attribute and the j th categorical attribute, respectively.
The distribution of the original data objects on each attribute is different. The attribute has great impact on the overall deviation degree of data objects when the discrete degree of the distribution of data objects on it is large. Therefore, the entropy weight method [36] is applied to objectively weight d attributes according to the discreteness of the distribution of data objects on the attributes.
where, E j N and E j C are the information entropy of data objects on numerical attributes and categorical attributes, respectively.
The values of data objects on the categorical attributes are expressed by where, the elements in the set X j C are mutually exclusive, x l j C (l=1, 2, …, c) indicates the l th value of the data object on the j th categorical attribute, and c indicates the number of data objects taking different values on the j th categorical attribute. The appearance frequency of x l j C in X is Information entropy of a group of data objects in the categorical attribute is calculated by: In order to eliminate the differences in dimensions and orders of magnitude of the original data objects on numerical attributes, the min-max normalization method [37] which linearly transforms the variables is used to process the numerical data objects, and the standardized ones are recorded as, where, max } represent the maximum and minimum value of data objects regarding numerical attributes, respectively.

B. CONSTRUCTING AN ADAPTIVE k-NN GLOBAL WEIGHTED DIRECTED NETWORK MODEL
The k-NN of the data object x i is a collection of data objects, that is, the distance from x i is less than or equal to the distance from x i to its k th neighbor (represented by D i * ). It is defined as follows: where, D im is the distance between x i and x m . The Heterogeneous Euclidean-Overlap Metric [38] is used to calculate the distance among data objects on the mixedvalued attributes, that is where, The value of k has a great impact on the accuracy of outlier detection results. Scholars have proposed plenty of methods to obtain it, however, most of them determine the value of k by combining parameter optimization algorithms and evaluation indexes on the premise that the outlier objects in the dataset are known. For unlabelled datasets, this kind of methods are difficult to give an appropriate k. Specially, the distribution characteristics of data objects are various for different datasets. In order to ensure the accuracy of detection results, the corresponding k needs to be determined according to the distribution characteristics of datasets. Therefore, based on the principle of the k-NN, an adaptive search algorithm is given for the appropriate value of k according to the distribution characteristics of datasets, shown as Algorithm 1.

Algorithm 1 Automatic search algorithm for the value of k
is a collection of data objects, in which the element is the data object that regards x i as the r-NN. N(Ω r (x i )) represents the number of the elements in Ω r (x i ).
Then, an adaptive k-NN weighted and directed network model (shown as Fig. 1 (b)) M=(X, E) is constructed based on the k-NN of each data object, where the elements of X = (x 1 , x 2 , …, x n ) represent the nodes (data objects) in the network model, E is a binary matrix and the element E im in it represents that exists an edge from x i to its neighbor x m , that is: The adaptive k-NN weighted and directed network of the dataset can be represented by an adjacent matrix A under the rule that: if A im > 0, then there will be a directed edge from x i to x m , and the weight on this edge is A im . The elements in matrix A are defined as follows: In this network, it is assumed that there is an edge between one node and its k-NN, which is from the node to its k-NN, and the weight of the connecting edge is defined by Eq. (12). As shown in Fig. 1 (b), it is found that there are two sub networks. With the different distribution of datasets, there may be multiple sub networks, or isolated nodes, etc. Therefore, a node x G is introduced, and it is assumed that there are bidirectional edges between x G and x i (i=1, 2, …, n), so as to form an adaptive k-NN global weighted directed network, as shown in Fig. 1 (c). The edge weights of the bidirectional edges between x G and x i (i=1, 2, …, n) are expressed as, The adjacency matrix A G of the adaptive k-NN global weighted directed network is denoted as,

C. PERFORMING A RANDOM WALK PROCESS TO IDENTIFY OUTLIERS
In a specific network, a random walk process is defined as a stochastic process that a random walker moves from x i to x m (i, m=1, 2, …, n, m ≠ i) in the next random step with specific probability. And the transition probability of the random walker moving from one node to another only depends on the current state and remains unchanged throughout the process, expressed as: where x t , x t-1 represent the position of the random walker at time step t and t-1, respectively. p im t represents the transition probability of the random walker moving from x i to x m at time step t.
The adaptive k-NN global weighted directed network represented by an adjacent matrix A G uniquely defines a random walk process. The transition probability matrix is obtained from A G : where D is a diagonal matrix, and each element in the matrix is equal to the sum of the corresponding rows of A G . The random walker is expected to jump to a neighbor with a greater similarity to the current node at the next time step. Using the edge weights between nodes to customize the transition probability of the random walk process can achieve the above effect and effectively complete the transition process between nodes. Starting from any state at any time, the random walk process will converge to an equilibrium state after a certain number of iterations, and the probability of the random walker at each node will not change. Using an iterative method, the stationary distribution vector of the random walk process on the adaptive k-NN global weighted directed network can be estimated. The iterative process is formalized as follows: where, the element in π t = (π 1 t , π 2 t , …, π n t , π G t ) represents the probability of a random walker remaining at the node at time step t. And π 0 = ( After the random walk process reaches equilibrium, each element in the stationary distribution vector π t can be explained as the visited probability for the corresponding node in the adaptive k-NN global weighted directed network. Based on the constraints on the behavior of the random walkers, it can be inferred that the potential outlier notes will have less chances to be visited by the random walker, therefore they will be assigned relative smaller scores in the stationary distribution vector. Considering the influence of x G , an index (outlier score) is developed by assigning the visited probability of x G to each real node to measure the outlier degree of data objects, which is denoted as, In order to ensure the convergence of the random walk process, x G is added in the construction process to ensure the connectivity of the network model, and non-existent edge pointing to itself is restricted on the network model to avoid self-reinforcement. The transition probability matrix of the random walk process is defined by the similarity between the data object and its neighbors. When the random walker is located on a real node, it always jumps to its neighbor with the greatest similarity at the next time step. The setting of Eqs. (11)- (14) ensures the completion of the following process, that is, the random walker always jumps with a greater probability to the neighbor with the greatest similarity, x G , and the node which has the greatest similarity with its all neighbors at the next time step, when it is located at the real node, the isolated node, and x G at the current time, respectively. The value in the stationary distribution vector is used to express the probability that a node is accessed when the random walk process reaches a stable state. Based on the transition mechanism of the random walk process, the potential outliers should get a smaller access probability. In order to eliminate the influence of x G on the detection results, the visited probability of x G to real nodes based on Eq. (13) and Eq. (14) is assigned to define outlier score (Ψ i ). It can be seen from Eq. (19) that the larger Ψ i is, the more outlying x i tends to be.
The proposed outlier detection methodology is presented in Algorithm 2.

IV. EXPERIMENTS
The proposed method is experimentally verified on three UCI datasets with three related outlier detection methods: the neighborhood information entropy-based outlier detection method (NIEOD) [39], OutRank [23], and the virtual outlier score model (VOS) [33], and the experimental environment is shown in Table 1. The NIEOD is proposed to detect outliers in the dataset with numerical, categorical and mixed-valued data by using the neighborhood information system and information entropy. The heterogeneous distance and self-adapting radius are applied to determine the neighborhood information system of the dataset, the neighborhood information entropy, relative neighborhood entropy, deviation degree, and outlier factor are further constructed based on the neighborhood information around each data object to measure the outlier degree of each data object. It has extended the outlier detection methods which are based on the traditional distance and rough set, and more applies to the datasets with some uncertainty mechanisms. There are two parameters in this method, namely, the neighborhood radius adjustment parameter and the judgement threshold.
The OutRank explores the application of the random walk process in outlier detection for the dataset with numerical data, the random walk process can effectively capture not only the uniformly dispersed outliers but also small clusters of outliers. It builds a weighted undirected neighborhood network model and uses the cosine similarity between objects and the shared-nearest neighbor density to define similarity metric, respectively. It has yield higher detection rates with lower false alarm rates than the outlier detection methods based on distance and density on both real and synthetic datasets.
The VOS improves the outlier detection methods based on the random walk process, combines the k-NN with the network model to identify outliers in the dataset with numerical attributes. It uses the top-k similar neighbors of each data object to construct the network model, and implements outlier detection by executing a tailored random walk process. By making full use of the local information of each data object and considering the outlier degree of each data object from a global perspective, the effectiveness of this method is demonstrated theoretically. The proposed method in this paper can identify the outliers in a dataset with numerical, categorical or mixed-valued data integrating the advantages of the above methods, which overcomes the defects of these methods to a certain extent. And, the predetermined parameter such as k is not necessary in the proposed method.

A. UCI DATASETS
Glass Identification dataset, Hayes-Roth dataset, and Lymphography dataset in the UCI machine learning library [40] are applied to evaluate the performance of the proposed method. These datasets are marked for classification and the rare classes are known, the data objects in the rare classes are considered as outliers.
Glass Identification dataset contains 214 data objects, which described by 1 name attribute, 9 numerical attributes, and 1 class attribute. The 214 data objects are divided into 6 categories with 70, 76, 17, 13, 9, and 29 data objects in each category, respectively. The class 5 has the fewest objects and can be treated as the rare class, in which the data objects are outliers.
Hayes-Roth dataset contains 132 data objects with 1 name attribute, 4 categorical attributes, and 1 class attribute. The 132 data objects are divided into 3 categories and each category has 51, 51, and 30 data objects, respectively. The class 3 has the fewest objects and can be treated as the rare class, in which the data objects are outliers.
Lymphography dataset contains 148 data objects, which described by 3 numerical attributes, 15 categorical attributes, and 1 class attribute. The 148 data objects are divided into 4 categories with 2, 81, 61, and 4 data objects in each category, respectively. The class 1 and class 4 have the fewest objects and can be treated as the rare classes, in which the data objects are outliers.
The data distribution of the above three datasets is shown in Fig. 2 (a), (b), and (c) respectively.

B. EVALUATION METRICS FOR THE PERFORMANCE OF METHODS
In order to quantitatively analyze the experimental results of different outlier detection methods, three traditional information system quality metrics "precision"(Pre), "recall"(Rec) and "rank power"(RP) are used in this paper [41].
Pre calculates the proportion of the real outliers in the dataset identified as outliers in the first z data objects: where, N identified represents the number of the real outliers identified as outliers in the first z data objects.
Rec measures the percentage of the real outliers identified in the first z data objects and all real outliers in the dataset: where, N real is the number of the real outliers in the dataset.
Pre and Rec estimate the accuracy of detection results of an outlier detection method, but neither of them can accurately compare the quality of detection results of different methods. For example, Pre and Rec of two real outliers identified by the method at the first two positions are the same as those at any two positions in the first z data objects. The location, where the real outliers are identified, is usually an important factor in comparing different outlier detection methods. Therefore, RP is introduced to measure simultaneously the position and number of the real outliers: where RP∈[0, 1], O L represents the position of the L th real outlier. In particular, RP = 1 if and only if all real outliers are at the top of the objects selected by the outlier detection method. Obviously, larger RP means better performance of the outlier detection method.
Pre and Rec are positively correlated with the effectiveness of outlier detection methods. When Pre and Rec are the same, the larger the RP is, the more effective the outlier detection method is.

C. EXPERIMENT RESULTS
Here, the detection results of the proposed method with those of other three methods on three different types of UCI datasets are compared by the above indexes. The above four methods no longer classify the data objects into normal objects and outliers. Instead, each data object is assigned a score to measure the outlier degree, and then the data objects are sorted (in the ascending or descending order) based on the judgment mechanism of the method. The first z data objects which may include the real outlier and/or the identified (but not real) outlier are chosen to verify the performance of the methods. VOLUME XX, 2022 8 The scores of data objects in the dataset based on the proposed method, the NIEOD, and the VOS are sorted in a descending order, while based on the OutRank is sorted in an ascending order considering the different mechanism of each method to construct the index to measure the outlier degree of data objects. With the increase of the first z objects selected, the number of the real outliers identified by the above methods is different, and the detection results on the three datasets are shown in Table 2, Table 3 and Table 4, respectively.
For better visualization, the detection results on different datasets in Table 2, Table 3 and Table 4 are illustrated in Fig.  3 (a), (b), and (c), respectively, and the Mean of outlier detection methods on UCI datasets is shown as Fig. 4.  The analysis of detection results based on evaluation indexes for each method on the three datasets are shown in Table 5, Table 6 and Table 7, respectively, and the corresponding results are shown in Fig. 5, Fig. 6  The running time of the four outlier detection methods on the three datasets are exhibited in Table 8.

D. DISCUSSION AND ANALYSIS
The acquisition details of the experiment results in Section 4.3 are as follows: (1) Glass Identification dataset As shown in Table 2, Fig. 3 (a), and Fig. 4, the number of the real outliers identified by the proposed method is the largest compared with the other three methods with the increase of the first z data objects. When the first 123 (57%) data objects are selected, the proposed method identifies all real outliers in the dataset, and the other three methods identify 6, 6 and 5 real outliers, respectively. And they identify all real outliers when the first 150 (70%), 213 (99%), and 200 (93%) data objects are selected, respectively. When the first 213 data objects are selected, the four methods identify all real outliers in the dataset, and the average number of the real outliers identified by each method is 5.9, 4.3, 3.5, and 3.6, respectively. From Fig. 5, it can be found that when the selected data objects are less than or equal to 150, the Pre and Rec of the proposed method are larger than the other three methods. When the value of z is larger than 150, the four methods have the same values of Pre and Rec, but the proposed method has a larger RP value than the other three methods. Besides, the running time shown as in Table 8 of the proposed method on this dataset is only 4.2 s, while the other three running times are 207.5 s, 15.1 s, and 8858.4 s, respectively. In conclusion, the proposed method in this paper has better detection performance than the NIEOD, the OutRank, and the VOS on the Glass Identification dataset, which is a dataset with only numerical attributes.
(2) Hayes-Roth dataset The proposed method detects all real outliers in the dataset when the first 49 (37%) data objects are selected, while the other three methods detect 14, 10, and 14 data objects as shown in Table 3 and Fig. 3 (b), respectively. When the value of z is 132 (100%), the NIEOD, the OutRank, and the VOS detect all real outliers. In Fig. 4, the Mean of the proposed method is 24.6 which significantly higher than that of the comparative methods. With the increase of the first z data objects selected, the Pre and Rec of the proposed method are significantly larger than the other three methods. When z is 132, their Pre and Rec reach the same. At this time, the RP of the proposed method is the largest among them as shown in Fig. 6 (c). Moreover, the proposed method has the shortest running time of 0.87 s, and the other methods are 1.7 s, 1.5 s, and 306.4 s, respectively. Compared with the other three methods, the proposed method has obvious advantages in identifying the number of the real outliers, Pre, Rec, and running time on the Hayes-Roth dataset which only contains categorical attributes.
(3) Lymphgraphy dataset From Table 4, Fig. 3 (a), and Fig. 4, the proposed method identifies 3 real outliers in the dataset when the first 3 (2%) data objects are selected, while the NIEOD, the OutRank, and the VOS identify 2, 1, and 0, respectively. The four methods detect 4, 3, 2, and 0 real outliers respectively when the value of z is 6 (4%). The NIEOD and the OutRank identify all real outliers first when the selected data objects are 28, the proposed method and the VOS identify 5 and 2, respectively. And the proposed method detects all real outliers in the dataset when the value of z is 31. However, the Mean of the proposed method is slightly higher than the other three methods. In Table 7 and Fig. 7, the Pre and Rec of the proposed method are the highest when the value of z is less than 28, consistent with the NIEOD and the OutRank when the selected data objects are greater than 28, and slightly lower than the NIEOD and the OutRank only when the first 28 data objects are selected. Additionally, the running time of the proposed method is 1.1 s, and the other methods are 4.7 s, 5.1 s, and 302.0 s, respectively. Therefore, the proposed method can be applied to identify the real outliers in the Lymphgraphy dataset, which has simultaneously numerical and categorical attributes.
Based on the above observations, the following conclusions can be drawn: (1) The proposed method can be used to detect outliers in different types of datasets, including the dataset with numerical attributes, categorical attributes, and mixedvalued attributes.
(2) The adaptive k is obtained automatically according to the distribution characteristics of the dataset, which ensures higher quality of outlier mining and reduces the cost in the process of parameter adjustment.
(3) In the proposed method, the k-NN is used to mine the local information of data objects, and a customized random walk process is used to explore the long-term correlation between related data objects from a global perspective. The combination of the k-NN and the random walk process improves the accuracy of detection results.

V. CONCLUSIONS
In this paper, an unsupervised outlier detection method based on the adaptive k-NN global network is proposed to identify the outliers in a DMA. First, the process of determining the weight of numerical and categorical attributes is given respectively to reduce the influence of different attributes on data objects, and the spatial distance between mixed-valued data objects is measured based on the Heterogeneous Euclidean-Overlap Metric. Second, an adaptive search algorithm for the appropriate value of k is introduced, which can automatically obtain the k according to the distribution of data objects in different datasets. And the k-NN of each data objects is obtained. Next, a network model is constructed based on the neighborhood relationship between data objects, and a customized similarity measurement is applied to calculate the edge weight of the network, in which the edge weight is directly proportional to the similarity between the data object and its neighbor. Then, a special random walk process is performed on the network model by defining the transition probability matrix using the edge weight. After the random walk process reaching the equilibrium state, outlier score (Ψ i ) is constructed to measure the outlier degree of each data object. Finally, a detailed empirical study is devised to illustrate the effectiveness, accuracy, and applicability of our method in detecting outliers using three typical UCI datasets. The proposed method has higher Pre and Rec in the detection results compared with the other three methods. It can be employed in the dataset with numerical attributes, categorical attributes or mixed-valued attributes.
However, calculating the stationary distribution vector is a time-consuming process, which limits the application of the outlier detection methods based on the random walk process to the large-scale and stream dataset. The main work in the next stage is how to apply this method to practical application scenarios with large datasets.