An in-depth study and improvement of Isolation Forest

Historically, anomalies detection was an important issue for industrial applications such as the detection of a manufacturing failure or defect. It is still a current topic that tries to meet the ever increasing demand in different fields such as intrusion detection, fraud detection, ecosystem change detection or event detection in sensor networks. That’s why anomalies detection remains a research topic of great interest for various research communities. In this paper, we focused on Isolation Forest (IForest), a well known, efficient anomalies detection algorithm. We provided a deep and complete view on IForest. We evaluated the impact of its input parameters (number of trees, sample size and decision threshold) on the efficiency of the detection and on the execution time. We discussed the benefit of including some anomalies into the training phase. To address the limits of IForest, we performed different experiments on commonly used real datasets and also on synthetic datasets with non trivial distributions. We designed multidimensional datasets where anomalies are carried by several dimensions simultaneously. Moreover, we used a varying density and distance between anomalies and normal data, for a variable similarity between these two data classes. We compared the performance of IForest against its improved version called Extended IForest. Finally, we designed and validated a new extension of IForest, based on the different individual trees decisions instead of a global forest decision that we call Majority Voting IForest (MVIForest). The experiments show that MVIForest has a shorter execution time than IForest, with almost the same accuracy.

Clustering of Applications with Noise (DBSCAN) [20]), or 18 distance based (such as K-means [14] and k-Nearest Neigh- 19 bors (k-NN) [35]). More recently deep learning approaches 20 have been applied to detect anomalies in complex or high-21 dimensional data ( [6], [11], [26], [28], [30]). Each approach 22 has its strengths and weaknesses. Several criteria can be 23 considered to compare these approaches and choose the most 24 appropriate method for the addressed context. In fact, one can 25 focus as an example on method scalability (ability to handle 26 large or multivariate data), human involvement (supervised, 27 unsupervised or semi-supervised methods), response time 28 (detection speed), resources consumption or the efficiency of 29 the detection compared to the sensitivity of the application 30 area (tolerance to false positives or false negatives). 31 Isolation Forest (IForest) is an anomaly detection method 32 based on a different concept compared to the approaches 33 presented above (statistical, clustering, nearest neighbors, 34 Moreover, they are easy to identify when they are dis-89 tinct from each other, which avoids the masking effect. 90 A direct consequence of these two characteristics is that 91 anomalous data are easier to isolate than normal data. 92 The majority of existing anomaly detection approaches 93 construct a model from existing data, either based on knowl-94 edge acquired from unlabeled data for unsupervised meth-95 ods, or using labels provided by an administrator (prior 96 identification of a class) for semi-supervised and supervised 97 methods. Modeling the behavior of normal (majority) data 98 enables the identification of the abnormal data as elements 99 that do not respect the standard behavior. Some approaches 100 calculate the distance or density of the different elements in 101 the dataset and identify anomalies as elements with a distance 102 exceeding a given threshold. These methods provide good 103 detection results but often suffer from a lack of scalability 104 depending on the dimension or size of the dataset. Indeed, 105 in case of a massive dataset, the execution time and memory 106 requirement of such a method can quickly reach its limits. 107 In general, nearest neighbor anomaly detection methods like 108 LOF [3], k-NN [35], etc. have a quadratic complexity O(n 2 ) 109 [15], as they are based on distance computation between each 110 pair of data. The response time of these methods is also not 111 adapted to the real-time processing of data streams produced 112 today by various systems and sensors with an always increas-113 ing arrival rate. The calculation of all the distances needed to 114 design the model and detect the anomalies can be difficult to 115 achieve in such conditions. 116 Isolation Forest is an anomaly detection method based on 117 an approach different from the others (statistics, clustering, 118 nearest neighbors, etc.). It does not calculate either distance 119 or density and therefore significantly reduces execution time 120 and memory requirement. IForest has a low and small mem-121 ory requirement [22] and a linear execution time, propor-122 tional to the size of dataset (section VI-C). It has an excellent 123 scalability that makes this method suitable for large datasets, 124 as well as for real-time processing. 125 Anomalies are rare and have a different behavior compared 126 to normal data. Figure 1 shows an example of anomalous data 127 (X 0 ) and normal data (X i ). Isolation is performed by succes-128 sive splits of the dataset. One can notice that anomalous data 129 are easier to isolate than normal data. X 0 is isolated in 3 steps 130 whereas it took 11 steps to isolate the normal data (X i ).

131
IForest ( [22], [24]) is the first method that has been pro-132 posed in the category of isolation based anomaly detection. 133 This method uses a set of random and independent trees 134 (itree) called a random forest. IForest generates a score for 135 each data using all the trees in the forest. IForest uses two 136 input parameters to calculate the data score. The parameters 137 are ψ which is the size of the randomly chosen sample from 138 the entire dataset and t which is the number of trees in 139 the forest. Each tree is built independently by sampling the 140 dataset, thus, the number of trees corresponds to the number 141 of samples. 142 IForest has two stages: the training phase which essentially 143 corresponds to the construction of the forest and the so-called 144 are assumed to be very few in comparison with the normal 160 data, the sample may contain only normal data or a mixture 161 of mostly normal data. 162 The isolation tree (itree) is a binary tree. The construction 163 of the tree is realized as follows: Initially, the root node 164 contains all of the sample data. When building the tree, every 165 internal node is split into two subnodes (left and right) until 166 a complete data isolation or reaching a maximal tree depth : 167 max_depth = ⌈log 2 (ψ)⌉. Data is considered isolated when 168 it is alone in its node, as we can see in the figure 1 where X 0 169 and X i are respectively isolated after three and eleven splits.  Figure 2 shows an example of an 178 itree. 179 To build the t trees of the forest, these two steps (sampling 180 and building a tree) are repeated t times. Thus, each tree has 181 its dedicated sample. For each itree, the sample is chosen 182 FIGURE 2: itree example, ψ = 8.
from the entire X dataset. The complexity of the training 183 phase is given by O(tψ log ψ) because each item of the ψ 184 data items of the t trees must be isolated or quasi-isolated in 185 the associated tree. The number of trees t is a key parameter 186 for the performance of IForest. 188 During the scoring phase, the score for each item of the X 189 dataset is calculated. This score represents the similarity de-190 gree between this item and the other items (mostly composed 191 of normal data). To calculate this score, the item x, has to 192 be treated by each tree of the forest. At the end, the item 193 x will certainly be placed in an external node of each tree, 194 depending on the split criteria. The number of nodes crossed 195 by x from the root node to reach its external node is called the 196 path length of x, denoted h(x). The pseudo-code calculating 197 the path length of x in a tree is given in the algorithm 2. Once 198 x is processed by all the trees in the forest, IForest calculates 199 the average length of the t paths of x denoted by E(h(x)). 200 Using a well known result on Binary Search Tree (BST), the 201 authors generate the score s(x, n) of x with the following 202 formula:

B. SCORING PHASE
C(n) = 2H(n − 1) − (2(n − 1)/n) is the average length 204 of the paths of an unsuccessful search in a binary search tree. 205 H(i) = ln(i)+0.5772156649 (Euler constant) is a harmonic 206 value. Note that C(n) is simply used to obtain a normalized 207 score s(x, n).

208
With this score formula, the authors classify the data as 209 follows: score of s ≈ 0.5 then the dataset does not contain any 212 identifiable anomaly;

213
• If E(h(x)) → 0, s → 1. If an item x has score very 214 close to 1 then it is an anomaly;

215
• If E(h(x)) → n − 1, s → 0. If an item x has a score s 216 much lower than 0.5 then it is normal data.

217
A relatively short average path length E(h(x)) implies 218 that the forest of t trees globally classifies x as an anomaly. 219 The complexity of the scoring phase is given by O(nt log ψ) 220 because each item of the dataset of size n is processed by 221 VOLUME 4, 2016 Algorithm 1: iTree(X, e, l) -Tree construction Input: X -set of data in the node, e -current depth of the tree, max_depth -maximal depth of the tree Output: iTree 1 if |X| <= 1 || e >= max_depth then 2 return ExternalNode(size = |X|) 3 else 4 Q ← list of attributes in X 5 q ← random choice of one attribute in Q (q ∈ Q) 6 p ← random choice of one value between the min and the max of x values for q return InternalNode(left ← iTree(l, e+1, max_depth), right ← iTree(r, e+1, max_depth), splitDimension ← q, splitValue ← p ) Input: x -a data, T i -a itree, e -current path length. / * e is initialise to 0 for the first call * / Output: The path length of x in the itree T i 1 if T i is an external node then 2 return e + C(T i .size) / * C(n) defined in to isolate each anomaly increases [22]. These abnormal data 236 may therefore be considered as normal. The fact that IForest 237 relies on random samples and not on the entire dataset to 238 build the forest of random trees helps handling masking and 239 swamping. In fact, sampling enables to consider data with 240 a lower density compared to the real dataset, which better 241 separates anomalies from normal data and also from each 242 other. Moreover, each tree in the forest is built with its own 243 sample. Trees do not necessarily isolate the same anomalies 244 and some samples may not even contain any anomaly. Hence 245 IForest is robust against swamping and masking effects.

247
The isolation technique for anomalies detection has been 248 addressed in several research studies. Since the first IForest 249 method ( [22], [24]) designed in 2008, many adaptations 250 and improvements have been proposed. Several researchers 251 have identified some limits of IForest and have proposed 252 improvements to overcome these limits. We classified the 253 proposed methods into two categories according to the type 254 of considered data.

256
In [23], authors proposed the SCiForest method which, like 257 distance and density based methods, allows to identify clus-258 ters in data. SCiForest is therefore an evolution of IForest 259 for clustering in order to detect local anomalies. SCiForest 260 randomly chooses a hyperplane in order to split the data in a 261 node and thus to take into account the anomalies carried by 262 several attributes at the same time. To separate the different 263 clusters, SCiForest establishes a split criterion for each node 264 taking into account the standard deviation of the data for this 265 node. Even if the processing of SCiForest is adapted to com-266 plex data, SCiForest has a high complexity which represents 267 a major drawback for this method. In the same context about 268 the manner to split data in the node, authors proposed in [17] 269 the Extended IForest algorithm (EIF). EIF corrects the bias 270 introduced in IForest because of the vertical or horizontal 271 splitting of the nodes which creates an inconsistency in the 272 scores provided by IForest. This method is further discussed 273 in section IV. In [31], the authors recently proposed the 274 FIF (Functional IForest) method to detect anomalies in 275 functional datasets. Starting from the observation that IForest 276 is not efficient for any dataset distribution, [25] proposed the 277 Hybrid IForest (HIF) method. They added a new decision 278 criterion taking into account the similarity between the data 279 of the same leaf node in order to consider the risk that 280 abnormal data are located in a leaf node having a relatively 281 long path. The main objective of this additional step is to 282 reduce false negatives. In fact, in the original version of 283 IForest, some anomalies can be missed because of their 284 similarity: this is the masking effect previously explained. 285 Combining supervised and unsupervised techniques, HIF is 286 able to detect anomalies in various datasets with different 287 distributions. However, it is not always easy to obtain the 288 labels for a supervised anomaly detection.

290
In [8], authors used IForest to compute the path length of 291 each data in the trees and defined this metric as a new distance 292 between two points. Compared to other distance methods like 293 It also offers an adaptation of IForest for categorical and 295 missing data.   In [36], the authors noticed that although it is claimed that    [32] and PCB-iforest [18] are 355 adaptations of IForest to the context of data streams. IForest 356 ASD uses a sliding window technique to retrieve data and 357 in each window the IForest method is executed to detect 358 anomalies based on a model previously created with data 359 from the previous windows. In case of drift, this model is 360 reinitialized. HSTrees is an evolution of IForest, designed 361 for streaming. HSTrees splits the nodes using the average 362 of the node items for the randomly selected attribute. As a 363 result, unlike IForest ASD, HSTrees manages automatically 364 the concept drift ( [12], [13]) without updating its model by 365 a reinitialization. Indeed, HSTrees is faster than IForest ASD 366 and builds its model independently of the considered dataset. 367 However IForest ASD, is closely related to the dataset. In 368 fact, to manage the concept drift, IForest ASD maintains an 369 input value (µ) which is the anomaly rate. When, in a given 370 window, the anomaly rate exceeds µ, IForest ASD assumes 371 that a change occurs in the normal behavior of the data (the 372 drift) and therefore updates the model with the data of the 373 current window. This update consists of deleting the model 374 and rebuilding a new model based on the data from the 375 current window. This approach is not very efficient because 376 the whole history of the normal behavior is lost with each 377 concept drift.

379
Randomized space trees (RS-Forest) [33] is also a method 380 of detecting anomalies in data flows based on the concept of 381 isolation. It is based on a density estimator used to decide if 382 necessary to update the model to manage the concept drift. 383 RS-Forest is based on the assumption that the anomalous data 384 have a low density, which joins the hypothesis previously dis-385 cussed on the scarcity of the anomalous data, their difference 386 from each other and also from normal data.

388
In [2], authors provided a recent global review on tree-389 based methods for anomaly detection. In this part, we only 390 focus on Isolation based anomaly detection methods. A sum-391 mary of the previously described isolation-based anomaly 392 detection methods is given in figure 3.

393
The isolation-based anomaly detection methods have been 394 implemented in several frameworks. The most known being 395 scikit-learn 1 which focuses on static data. It implements 396 the IForest algorithm. An other implementation of IForest 397 is provided by the H20 framework 2 which is also very 398 well known in the machine learning domain. Some versions 399 suitable for data streams like HSTrees [32] and IForest ASD 400  Moreover it does not depend on data distribution. In the next 413 sections, we will focus on EIF that we will compare to IForest 414 through different experiments. We will also present our new 415 extension of IForest.

418
IForest is a powerful method, but it has some drawbacks.

419
One of the limitations of IForest which has been addressed 420 in the literature is its inconsistency in the classification of 421 data. Indeed, IForest gives a score to each data. According 422 to the definition of the anomaly, we expect that the more the 423 data is different from the others, the higher is its score. But 424 this is not always the case with the scores given by IForest.  The homogeneous light color in the center of the heat map 434 gives the impression that the data is distributed on a disc and 435 ignores the ring shape with an almost empty center. Far from 436 dense areas, scores become higher. We can clearly see the 437 edge effect on the border areas, as well as darker areas which 438 correspond to empty areas. However, the heat map does not 439 have the same symmetry as the data. The heat map is not 440 invariant by rotation. More precisely, the ring has turned into 441 a square, and we can see the artificial appearance of four dark 442 corners with very high scores. These areas are considered as 443 anomalies.

444
This situation is due to the way IForest splits the nodes 445 during the training phase. Indeed, IForest splits each node 446 vertically or horizontally. This creates artificial areas of 447 concentration as if other data were there. Figure 4 shows 448 an example of splitting nodes when building a forest tree. 449 Note that the different splits of the dataset form artificial 450 rectangular areas of high density outside the ring of the 451 dataset.

452
FIGURE 4: Example of nodes splitting by IForest and the associated Heat map of scores: presence of artificial fictitious zones. X and Y are the coordinates of the data items.
An evolution of IForest that overcomes this problem is EIF. 453 The key idea of EIF is to split the data of the node according 454 to a randomly chosen direction (not necessarily horizontal or 455 vertical like IForest). This eliminates the fictitious zones cre-456 ated by IForest and consequently improves the consistency of 457 the scores.

459
As explained in the section II, in case of 2-dimensional data, 460 IForest splits each node horizontally or vertically (see figure 461 1). This breakdown creates a bias clearly visible on the heat 462 map of the scores. The method called Extended Isolation 463 Forest which aims to correct this inconsistency was proposed 464 in [17]. The major difference between EIF and IForest is the 465 way to split the data in a node. EIF contains two stages, just 466 like IForest: the training phase and the scoring phase. While 467 the scoring phase remains the same for both methods, the 468 training phase has changed significantly. Indeed, EIF splits 469 the nodes according to a point and a direction randomly 470 chosen by the combination of all the dimensions unlike 471 IForest.   In order to test the behavior and sensitivity of IForest to 537 different variations in data characteristics, we used synthetic 538 datasets. Indeed, real datasets often contain anomalies carried 539 by a single attribute and do not allow a detailed analysis 540 of the impact of certain parameters. We thus varied, on the 541 synthetic datasets, the dimensions of the data, the density of 542 the normal and abnormal data as well as the distance between 543 the anomalies and the normal data. These configurations 544 make it possible to evaluate the limits of IForest and to 545 adjust the choice of its input parameters. In the case of 546 multivariate data, we have designed anomalies carried by 547 all the dimensions at the same time, well enveloped in the 548 normal data.

549
The designed synthetic datasets are presented, in two 550 dimensions, in the form of a ring where the normal data 551 are uniformly distributed. Abnormal data are located in the 552 center of the ring, uniformly distributed on a disk of smaller 553 radius. In three dimensions, the normal data are contained 554 in the thick envelope of a sphere and anomalies are re-555 grouped in the center of the sphere, uniformly distributed 556 in a smaller sphere. The abnormal data are therefore is to test the effect of the previously discussed masking. 583 We therefore designed anomalies with a high similarity, by   [24]). Note that the 608 duration of the training phase is constant, independent of the 609 total size of the data. IForest has a low memory requirement equal to O(tψ) 612 (see [24]). The memory consumed is therefore constant, 613 independent of the size of the dataset, which represents a 614 considerable advantage of the method to process large data. In the case of anomaly detection which is an exercise of 617 binary classification in imbalanced datasets, there are only 618 two possible classes: normal or abnormal. In this paper, 619 we used scikit-learn [29] framework. However, unlike def-620 initions implemented in scikit-learn, we consider here the 621 following well known convention adopted by anomalies de-622 tection community: normal class is denoted by Negative and 623 abnormal class refers to Positive. We used scikit-learn API to 624 compute recall, ROC AUC and F1 score from the confusion 625 matrix.

626
Precision depends on True Positives (TP) and False Pos-627 itives (FP). Although precision is useful for assessing the 628 classification ability of a method, in some cases it is not the 629 best metric to consider. Particularly in unbalanced dataset 630 case where normal data are obviously larger than abnormal 631 data. In fact, in this case, the number of false positives (false 632 alarms) can be larger than the number of true positives. The specificity is useful for anomaly detection, as it helps 635 to evaluate the performance of the method in detecting nor-636 mal data and avoid false alarms. In some application fields, we prefer to focus on the 649 anomalies, more precisely on the rate of well-classified ab-650 normal data. In this case the most adapted metric is the recall.

651
There is clearly a trade-off between FAR and recall. The IForest is based on a score, denoted by s(x), which is 691 calculated for each data x on the basis of the average length 692 of its trajectory E(h(x)) in the different trees. In the original 693 IForest paper [22], the authors recommend 0.5 as the decision 694 threshold applied to the s score. Indeed,

695
• if E(h(x)) → 0 then s → 1. When s is very close to 1, 696 x can be considered as an anomaly;

699
There is clearly a lack of precision of the threshold from 700 which a given data should be considered as abnormal. This 701 weakness represents a first limitation of IForest. IForest has two input parameters: sample size and number 704 of samples. In [22] and [24], the authors recommendation is 705 to set ψ to 256. According to their experiences, ψ = 256 706 gives good results with low execution time and low memory 707 requirement. The authors recommend to set t to 100 trees 708 for stable results. They demonstrate that with 100 trees, 709 VOLUME 4, 2016 the result is optimal and that beyond 100 trees, there is no  because each data item has to be processed by each tree and 758 build its path before making a decision. This execution time 759 can be considerably reduced by choosing another approach. 760 We proposed to apply the majority voting method to declare a 761 data as an anomaly. This allowed us to reduce the false alarm 762 rate and the execution time. This method is presented in VII.

764
IForest is a multi-step random choice method. Indeed, the 765 choice of the sample is random for each tree in the forest. The 766 choice of the addressed dimension as well as the split value 767 is random for each node. It is worth checking the impact of 768 this randomness on the results of IForest by inspecting the 769 variance of the results on several runs of IForest. We carried 770 out this experiment by executing IForest 10 times with the 771 same decision threshold ie threshold = 0.5 on the shuttle 772 dataset described in the subsection V-A.

773
The table 3 shows that the 10 successive executions of 774 IForest give fairly constant results in terms of ROC AUC, 775 specificity and recall. The standard deviations of these three 776 metrics are very close to zero. These results illustrate the 777 stability of IForest despite its randomness. We can therefore 778 rely on a single execution of IForest.

780
The input parameters of IForest: sample size (ψ) and number 781 of trees (t) are of a high importance for the performance 782 of IForest. We carried out some experiments to assess the 783 impact of these parameters on the results. These parameters 784 must be well chosen by the user to optimize the results. In 785 particular, when applying IForest to a data stream with a con-786 cept drift, the input parameters should ideally automatically 787 be adapted to the varying characteristics of the stream. We 788 study here the impact of these parameters on the efficiency of 789 IForest. The execution time of IForest is related to the number of 792 samples or trees t and the size of each sample ψ. In this part, 793 we are interested into the impact of ψ. We have therefore set 794 t to its default value (100 trees) in the experiments.  In the figure 10, the study was carried out on the 4 datasets IForest is based on a random forest of several random and 838 independent binary trees. Their independence comes from 839 the fact that each tree is built on the basis of a single random 840 sample of the same size. All trees participate equally in the 841 decision-making regarding the classification of a data: nor-842 mal or abnormal. The number of trees to be built in the forest 843 is an input parameter of the method. Memory requirement is 844 closely related to the number of trees in the forest. In this 845 subsection, we conducted experiments to assess the impact 846 of the number of trees on the performance of IForest. For 847 that, we used different values for this parameter t. We set the 848 sample size to its default value: ψ = 256.

849
As expected, when the number of trees increases, the 850 execution time becomes longer, which corresponds to the fact 851 that all the trees are created during the learning phase and 852 also that during the test phase, data item must pass through 853 each tree in the forest for decision-making. As can be seen in 854 the figure 11, considering the ROC AUC, the specificity and 855 the recall, we notice that the number of trees does not have a 856 great impact on these results. Indeed, from a certain number 857 of trees, the values converge while the execution time con-858 tinues to increase linearly. We can therefore conclude that, 859 exceeding a given threshold, the number of trees does not 860 have a great impact on the performance of IForest in terms of 861 anomaly detection. The authors recommendation to generate 862 a collective decision based on the collaboration of t = 100 863 trees seems to be a good compromise between the execution 864 time and the quality of the decision. We also see in this 865 VOLUME 4, 2016   The data are often multivariate, describing the variation of at 914 least two observables over the time. In order to evaluate the 915 performance of IForest according to the number of dimen-916 sions in the dataset, we used the two and three dimensional 917 synthetic datasets presented in section V-B. The particularity 918 of these datasets is that the anomalies are carried by several 919 dimensions at the same time. It will therefore be necessary 920 to consider several dimensions to detect anomalies. When 921 the anomalies are carried by several dimensions at the same 922 time (Figure 13), IForest is less efficient. Anomalies are only 923 detected in these latter cases when the difference in density 924 and distance between the normal data and the anomalies is 925 very clear. IForest always makes a lot of false alerts, espe-926 cially on the borders of normal data. Several real anomalies 927 were not detected due to the split process of the nodes in the 928 learning phase of IForest. Indeed, IForest performs splits by 929 choosing an attribute randomly. The split is therefore done 930 with only one attribute at a time. However, considering only 931 one dimension, the abnormal data seem to be normal because 932 it is in the same range of values as the normal data. We 933 • when s (x, n) is much less than 0.5 then x is normal.

956
In practice, the anomaly decision is taken when the score 957 is greater than 0.5. But, this rule can cause false alarms or 958 false negatives, because the optimal decision threshold is not 959 always equal to 0.5. In order to illustrate this observation, of normal data has a score between 0.5 and 0.6. It is clear 971 that the threshold score 0.6 is much more suitable than 0.5 972 for the two datasets. With a score threshold of 0.6, for the 973 Synthetic_2 dataset, where the normal data density is low, 974 only some abnormal data were detected. The distributions 975 show that true abnormal data have a path longer than the 976 threshold and a score below the score threshold (0.6). On 977 the other hand, with the Synthetic_3 dataset, all anomalous 978 data have a score above the decision threshold. Furthermore, 979 comparing Synthetic_2 and Synthetic_3, we can notice that a 980 larger distance between normal and abnormal data engenders 981 a small path length for anomalies. This means that abnormal 982 data are quickly isolated when they are very different from 983 normal data. The difference in density between normal and 984 abnormal data has therefore a considerable impact on the 985 efficiency of detection for IForest and could be taken into 986 consideration when choosing the detection threshold.

987
In this section, we studied IForest from different angles. 988 The random construction of the forest and the independent 989 itrees represent a key idea of IForest and enable to make a 990 robust decision. For the choice of the input parameters, we 991 found that using a large number of trees does not really im-992 prove the ability of IForest to detect the anomalies, however 993 it increases the execution time. Using about 100 trees seems 994 to be a good compromise. From a given threshold, the sample 995 size increases the execution time of IForest and decreases 996 the anomalies detection performances. Thus, IForest can be 997 improved by establishing the optimal sample size according 998 to the dataset characteristics. The performed experiments 999 show also that the optimal decision threshold is difficult to 1000 fix as it is dependent on the similarity between normal and 1001 abnormal data according to the IForest trees. We propose 1002 in the next section Majority Voting IForest, an extension of 1003 IForest improving its execution time. IForest identifies anomalies based on a collective decision 1008 produced by all the trees built during the training phase. 1009 Indeed, each tree i participates in the decision making by 1010 the path h i (x) for each data x. The average path of a data 1011 is used in the calculation of its score as follows: s(x, n) = 1012 . This formula implies that it is necessary to 1013 VOLUME 4, 2016 The used metrics for these evaluations are described in V-C.

1042
The  One can notice that MVIForest gives similar results to 1048 IForest in terms of ROC AUC. However, MVIForest is al-1049 ways faster than IForest with an execution time shortened 1050 by 35% in average. The best result is obtained with HTTP 1051 dataset, the largest considered dataset, where the execution 1052 time of MVIForest represents only 60% of the execution 1053 time of IForest. When anomalies are obvious and easy to 1054 detect with a high distance to normal data and a very different 1055 density, MVIForest can save up to 50% of the execution time 1056 of the test phase. However when most of the data are in the 1057 area of uncertainty between the anomalies and the normal 1058 data, the execution time will be almost the same.    data, a low density of anomalous data, and a high distance all these datasets, IForest and MVIForest give quite similar 1087 results in terms of detection. Moreover they generate each 1088 time false alarms by classifying the normal data at the border 1089 as abnormal. As explained above, this is due to the way the 1090 nodes are split: IForest randomly selects an attribute at each 1091 split. This problem has been corrected by EIF by introducing 1092 hyperplanes. As can be noticed, on the same datasets, EIF 1093 did not make any false alarms and performed a maximum 1094 specificity (always equal to 1 in Table 7). EIF classifies 1095 correctly all normal data, however, it misses abnormal data 1096 when there is a small distance separating them from the 1097 normal data. In such a configuration, EIF produces a very 1098 low recall (null for Synthetic_2).

1099
Furthermore, considering all data dimensions at the same 1100 time when splitting nodes makes EIF slower than IForest. As 1101 can be seen in the  requires speed and efficiency, MVIForest would be a wise 1119 choice. However, depending on the objective, one can choose 1120 EIF. EIF classifies normal data better, but may miss some 1121 anomalies while MVIForest is better at quickly detecting 1122 anomalies, but may generate false alarms. The application 1123 constraints as well as the context has to guide the choice for 1124 the most suitable method.
1125 Figure 17 presents an exploration of the path depth distri-1126 bution of all data, with IForest and EIF, for the Synthetic_5 1127 dataset. We focused on this dataset because no anomalies 1128 were detected. MVIForest was not considered in this experi-1129 ment because with MVIForest, each data item does not have a 1130 unique average score, but many scores given by the different 1131 used trees. We recall that an anomaly is characterized by 1132 a high score, resulting from a shallow average path on the 1133 trees in the forest. Starting from the threshold score (0.5), 1134 VOLUME 4, 2016  Moreover, because of the high density of abnormal data, 1154 they could not be isolated early enough in the forest. They 1155 obtained depths greater than the threshold (see Figure 17) for 1156 all methods. 1157 We performed the same experiments on the 3-dimensional

1167
Isolation Forest is one of the best methods for detecting 1168 anomalies. It is fast, accurate and does not require huge 1169 resources compared to other techniques such as clustering 1170 or nearest neighbor. In this paper, we carried out a state of 1171 the art of isolation-based anomaly detection methods. Most 1172 of them are improvements of IForest. Each new method 1173 addresses a limit or an adaptation of IForest to a new con-1174 text. In this paper, we highlight some weaknesses of IForest 1175 not addressed in these improvements, notably the choice of 1176 input parameters and the impact of the characteristics of the 1177 datasets (number of significant dimensions, density of normal 1178 and abnormal data, etc.). We tested IForest and its extended 1179 version (EIF) on several real and synthetic datasets to il-1180 lustrate the weaknesses we identified. We then proposed an 1181 improvement of IForest changing the way it makes decisions. 1182 This new version, which we called MVIForest (Majority 1183 Voting IForest), is faster than IForest, since its execution is 1184 interrupted as soon as a majority decision is possible, without 1185 requesting all the trees.

1186
In our future works, we will compare the proposed MVI-1187 Forest to other existing anomalies detection methods. Despite 1188 the performance of the studied isolation based methods, they 1189 are not adapted to the streaming context. Isolation based 1190 anomalies detection in data streams has not been enough 1191 explored in the literature. In our future work, we will focus 1192 on the version of IForest adapted to the context of data 1193 streams: IForestASD. We will propose a distributed version 1194 of IForestASD for a better performance and a higher scala-1195 bility.