Similarity Based Feature Transformation for Network Anomaly Detection

The fundamental objective behind any network intrusion detection system is to automate the detection process whenever intrusions occur in the network. The problem of the network anomaly detection is to determine, if the network incoming traffic is legitimate (or) anomalous. Automated detection systems designed to identify incoming anomalous traffic patterns usually apply widely used machine learning techniques. However, irrespective of any system model which is developed to identify anomalous traffic, all these models requires comparing anomalous and normal traffic patterns. Such comparisons implicitly depend on the ability of the underlying machine learning model to gauge the similarity between a known legitimate observation and the target. The efficiency of any network anomalous detection system depends on the use of distance (or) similarity measures and how they are actually applied. A novel distance function which can be applied to determine the similarity between two conditional feature pattern vectors is an important contribution of present research. Feature dimensionality is another important issue for any machine learning algorithm. In the present work, feature reduction is achieved using the proposed feature transformation technique. However, our approach for feature transformation uses the proposed gaussian distance function to achieve dimensionality reduction to represent the original input dataset in the new transformation space. We have also proposed new computation expressions for determining equivalent deviation and threshold in gaussian space. Experiments are performed on KDD and NSL-KDD datasets by considering widely applied classifier algorithms in various state-of-art research contributions. For performance validation of machine learning models, k-fold cross validation is applied by setting k to 10 through considering evaluation parameters such as accuracy, precision and recall. Experiment results have proved that our approach for anomaly detection that applies the proposed feature transformation technique proved comparatively better to detection methods CANN, GARUDA, and UTTAMA addressed in the recent research literature.


I. INTRODUCTION
The fundamental purpose of any network anomaly detection system is to precisely and methodically detect diverse types of malicious traffic patterns that may not be detected by conventional firewall systems. Designing a potential and powerful intrusion detection system has three essential challenges.
The associate editor coordinating the review of this manuscript and approving it for publication was Juan A. Lara .
These three challenges are i) Addressing the high dimensionality problem of input observations ii) Applying the appropriate machine learning technique which does not suffer from issues such as overfitting and underfitting and iii) Choosing the appropriate distance measure (or similarity measure) to gauge the similarity between any two network observations. Feature selection [1], Feature representation [19], [20] and dimensionality reduction approaches [21]- [23] have been studied and extensively addressed in many research contributions related to text classification, data fusion, image fusion, medical data classification and various machine learning and data mining applications. Feature reduction techniques are also applied for the design of intrusion detection systems (IDS) [19], [20] in the literature. Several studies are also carried on how to choose a right classifier and apply it for building efficient network intrusion detection [1]. The performance of NIDS is implicitly related to the choice of distance measures [18], [19] that are applied by IDS for reaching a decision, if an incoming observation is actually normal (or) an abnormal one. A relatively little effort is made by researchers to devise new distance functions [19], [20] that can be applied by NIDS for efficient intrusion and anomaly detection.
The recent studies such as CANN [23], CLAPP [20], and UTTAMA [24] have applied feature reduction techniques to improve accuracy and detection rates of IDS. The distance measure applied by CANN is the Euclidean distance function. CLAPP, UTTAMA approaches have applied membership functions for the learning process. However these studies did not propose novel similarity measures for carrying unsupervised feature learning and supervised learning tasks. Although CANN [23] has reduced time consumed by classifiers, the detection accuracies of U2R and R2L classes have not been so promising. For example, detection accuracies of U2R and R2L classes are almost zero for CANN. Although CLAPP and UTTAMA have attempted to improve detection accuracies of U2R and R2L attack classes, but these approaches were just limited to applying membership functions. Fundamentally, the contribution addressed in our paper is mainly motivated from all these studies.
The fundamental objective of the present work is to address the challenge in detecting U2R and R2L attacks with higher accuracy, precision, recall rates by obtaining an equivalent representation of the original dataset through projecting it on to a new transformation space. Another important aim of the present study is to recommend a novel distance measure that can be used to perform similarity computations for feature clustering, feature representation, and supervised learning for efficient intrusion detection.
The organization of the paper is as follows. Section II summarizes the state of art research contributions which are the main basis of motivation for the proposed work. Section-III describes the proposed approach and algorithms for feature transformation and supervised learning; Section-IV outlines various experimental results obtained using both the proposed and existing methods. Finally, Section-V summarizes important findings and concludes the paper.

II. BACKGROUND AND MOTIVATION
The distance function introduced in this paper is motivated from several state-of-art research contributions that have proposed distance functions for text classification, temporal pattern mining, software component classification and medical data classification. Distance measures and similarity measures are widely applied in various data mining and machine learning algorithms that require distance (or) similarity computations to be carried as part of algorithm processing. One of the recent contributions that motivated the present contribution is the work by Jiang et.al [1]. In the study reported by Jiang [1], an approach for reducing dimensionality of feature vectors is suggested for text classification. For similarity computation between feature vectors, Jiang et.al [1] has proposed a fuzzy gaussian function which is applied to self-construct feature clusters. Another, important recent research contribution by Jiang et.al [2], [3] is the gaussian text similarity measure for text classification. Similarity measure [2], [3] proposed by Jiang takes into consideration, the effect of feature deviation on text features to best estimate, the similarity degree between text document vectors. The feature similarity function and text similarity functions designed in [1]- [3] are based on feature vectors that are non-binary. The feature similarity functions that are proposed in contributions [4], [5], [9] are based on binary representation of feature vectors. Another interesting similarity measure is the gaussian based temporal similarity measure proposed by Chen et al. [6], [7] to uncover the similarity between temporal patterns on time-interval data. The text feature vector dimensionality problem is recently addressed by Suresh et al. [8] which is motivated from [1]- [3]. Motivated from the text similarity function [8], similarity functions for measuring software component similarity (which are based on determining binary feature vectors) are proposed by Vangipuram et al. [9]. Similarity measures to compute temporal similarity in Z-SPACE and gaussian space are proposed by Vangipuram et al. [10]- [16] and these measures require equivalent deviation and equivalent threshold values to be determined to compute similarity in new transformation space. Another contribution is the imputation measure MANTRA [17] suggested to find similarity between complete and incomplete medical records for medical data classification.
The efficacy of network intrusion anomaly detection algorithm banks on the use of an appropriate distance measures and similarity measures which are applied to compute the similarity of new incoming observations (not present in the training set) to the available observations in the trained knowledge base. A recent survey reported by Fahy et al. [18] proved that many research studies related to network intrusion anomaly detection have not documented the measures that are applied by machine learning algorithms in published research. Relatively less research literature is available on similarity measures applied in the research contributions that addressed the problem of network intrusion anomaly detection. A detailed study carried by Weller-Fahy et al. [18] provides us a complete overview of various similarity measures that are used within the field of NIAD (network intrusion anomaly detection) research. The fundamental idea behind network intrusion detection (or) the design of any NIDS (network intrusion detection system) is to automate the detection process whenever intrusions occur in the network. Thus, the VOLUME 8, 2020 problem of intrusion detection [18] may be viewed as a subproblem within NAD (network anomaly detection). Hence, the idea behind the network anomaly detection is to determine, if the network incoming traffic is legitimate (or) anomalous traffic. Automated detection systems that are designed to identify incoming anomalous traffic patterns usually apply widely used machine learning techniques such as supervised learning (or) un-supervised learning. However, irrespective of any system model which is developed to identify anomalous traffic, all these models require comparing anomalous and normal models [18], [19]. Such comparisons implicitly depend on the ability of the underlying machine learning model to gauge the similarity between a known legitimate observation and the target. This means that efficiency of network anomalous detection system banks on the use of similarity measures and how they are actually applied.
An important contribution to NIDS research literature is the contribution by Aljawarneh et al. [19] in which a distance function is introduced to perform feature clustering. These feature clusters are used to achieve dimensionality reduction. Fuzzy membership functions are proposed by Gunupudi et al. [20], [21] for feature clustering. These membership functions which are proposed by [20], [21], [22] are applied to obtain the similarity between feature pattern vectors for anomaly detection. An intrusion detection system, namely CANN proposed by Lin et al. [23], is the recent state of art contribution that combines the cluster center information with the nearest neighbor information to define a new distance which is one dimensional. Although CANN aims at addressing time efficiency and space efficiency, the accuracies of U2R and R2L attacks are not favorable. For example, using CANN [23] and choosing KNN classifier with k = 1, attack accuracies of U2R and R2L classes on KDD dataset with 19 attributes are obtained as 3.85% and 57.02%. Also, from experiment analysis [23], the accuracies of KNN (K = 1) for U2R and R2L classes on KDD dataset with 19 attributes are 17.31% and 91.74%. Similarly, for SVM classifier (degree 2), accuracies of U2R and R2L attack classes are 61.54% and 78.95% respectively. The overall accuracy obtained using CANN (K = 1) is 99.46% and this value is slightly lesser than KNN (K = 1) which is 99.89%. Thus, the challenge in design of new intrusion detection techniques, approaches and algorithms is to essentially aim at improving the accuracies of the low frequency classes such as U2R and R2L classes in KDD dataset. Another recent contribution that has proposed an approach for anomaly detection is UTTAMA [24]. UTTAMA applied a fuzzy membership function for similarity computation and feature transformation. The overall accuracy of UTTAMA on KDD dataset with 19 attributes is 99.89% when J48 classifier is applied for classification. When compared to CANN (K = 1), UTTAMA (J48) has achieved better accuracies for low frequency attack classes. Aljawarneh et al. [25] applied feature selection on NSL-KDD dataset. An accuracy of 99.7% is reported on NSL-KDD dataset. A recent survey on intrusion detection techniques discussed various issues in designing an efficient intrusion system and some of the state of art contributions [18], [26].
A machine learning approach, PAREEKSHA is proposed by Nagaraja et al. [27] for intrusion detection. The membership function has its basis from contribution [1], [2]. On similar lines, [28] also proposed a membership function for detection of low frequency attacks. Network intrusion detection is a challenging task and it further becomes much more challenging for the machine learning algorithms when low frequency observations have to be detected with higher accuracies through overcoming challenges such as over fitting and under fitting. Many times, classifier algorithms employed to detect low frequency attacks do not perform well. This is because of the lesser number of instances in the dataset for those classes. Cross fold validation is usually applied to evaluate classifier performance and validation of machine learning models. Thus, improving classifier accuracies of low frequency classes is an important challenge that mandates an immediate attention from researchers. Conditional probability [1] can be used to derive hidden information and knowledge between features and dataset class labels. Such information may later be used to carry un-supervised learning [19], [24].
The present research contribution is motivated from all the above discussed state of art contributions. It has been observed that, there is a scope for devising new similarity and distance functions that can be applied by detection systems to achieve better classification and detection rates. The next section describes the proposed method which is based on feature clustering for feature transformation.

III. PROPOSED METHOD
This section outlines the proposed method for the anomaly detection. Our approach extends the recent contribution by Vangipuram et al. [19] by proposing a new distance function which also considers feature distribution to determine the similarity between observations. Also, novel computation expressions to obtain the equivalent deviation and threshold values in gaussian space are proposed. The computed deviation value is used in similarity function to carry similarity computation. The basic idea is to represent dataset in new dimensionality space for improving classification and detection rates. Our method involves three tasks to be carried as outlined in [19]. They are (i) Feature clustering which is based on the use of the proposed gaussian based distance function (ii) Dimensionality reduction by feature reduction (iii) Applying the machine learning algorithm which uses the first two outcomes. Algorithms for these three tasks are outlined below.
: iterative index variable, varies from 1 to m g : number of clusters, initially g = 0 C g : g th cluster Begin 1. Read the allowable dissimilarity value, δ U 2. Determine the initial deviation value, σ f and transformation threshold value, δ f using Eq. (8) and Eq. (11) respectively. 3. Choose the first feature pattern (say − → f 1 ). Initially, g = 0. Generate the first cluster by placing the first feature pattern, say, − → f 1 in this cluster. Set g = g+1. Call it C g . Now, C g contains only − → f 1 . 4. Initialize mean and deviation of generated cluster (initially for the first cluster and then for other generated clusters).

Mean of the first cluster is an m-dimensional vector and is same as the
If no other feature pattern exists then go to step-10 else go to step-6. 6. Choose the feature pattern that is not yet clustered, say − → f p . Determine the distance between this feature pattern ( − → f p ) and mean of each existing cluster ( − → f p to existing cluster and go to step-8 else Set g = g+1. Create a new cluster. Call it C g and repeat the process in step-4. 8. Update mean vector of the cluster after adding the feature pattern to the cluster. The new mean shall be the average of the existing feature pattern. 9. Go to step-5. 10. At the halt of incremental clustering, 'g' clusters and their respective mean vectors are generated. 11. Compute the respective standard deviation vector for each of these generated clusters by considering only those feature patterns that exist in respective clusters. 12. Update the deviation of final clusters. Now, the final deviation vector of each generated cluster shall be sum of the initial chosen deviation and respective deviation computed in step-11. End  Let F i symbolize i th feature in the feature set, F and f io symbolize, the value of feature f i in o th observation. The representation − → X i symbolizes feature pattern vector corresponding to any feature, F i . Our approach requires computing feature pattern vector for every feature; F i present in the feature set, F. As mentioned already, |d| symbolizes dimensionality of the feature pattern vector. We represent the feature pattern vector using equation (4) where X id symbolizes the probability that feature, F i belong to the class, D d .
The element value X id in equation (4) can be obtained by applying equation (5) In equation (5), f ji symbolizes j th feature value in the i th observation of the observation matrix. The value of M j d is 1; if the j th feature symbolized using F j belongs to class label, D d and M j d is equal to 0; if F j do not belong to class label, D d . The next subsection gives the proposed distance function to find the similarity between any two feature conditional probability vectors.

E. PROPOSED DISTANCE FUNCTION
This subsection gives the computation expression of the proposed feature distance function that can be applied to determine the similarity between any two feature pattern vectors and input observations. The similarity condition for considering two feature pattern vectors as similar is stated below.
Similarity Condition: Given − → X P and − → X q are the two conditional probability vectors, − → X P and − → X q are similar, if and only if, the distance obtained using the distance function F dist − → X P , − → X q satisfies the condition F dist − → X P , − → X q ≤δ U .

1) PROPOSED DISTANCE FUNCTION
Suppose, − → X P and − → X q are any two conditional probability vectors (i.e feature pattern vectors) and let the notation δ U symbolize the distance threshold. Let m be the dimensionality of the probability vector. Now, − → X P and − → X q can be represented as − → X P = (X P 1 , X P 2 , X P 3 , . . . . . . . . . ., X P m ) and − → X q = (X q 1 , X q 2 , X q 3 , . . . . . . . . . ., X q m ). The element values of the form X P i and X q i in the probability vector represented by − → X P and − → X q is the posterior probability value such that X P i ∈{0, 1}.
The distance between any two conditional probability vectors symbolized using − → X P and − → X q can be obtained by using the proposed distance function which is given by using equation (6), with α = 0.3679. where Eq. (7) represents the average fuzzy similarity value between − → X P , − → X q . The parameter 'σ f used in eq. (7) is the standard deviation value which can be obtained by applying Eq.(8).
The expression for computing deviation is given by Eq. (8), where δ U is the allowable dissimilarity chosen between 0 and 1 and α = 0.3679.

2) EXPRESSION FOR GAUSSIAN DISTANCE THRESHOLD
We know that δ U represents the distance threshold between vectors, − → X P and − → X q in euclidean space. Our approach requires computing the new deviation value for the gaussian space. The deviation for new tarnsformation space can be derived by considering single dimension vectors. In this case, for any given dimension (say, i th dimension), the distance between vectors X p i and X q i is given by Eq.(9) Now, the distance between − → X P and − → X q using proposed distance function is given by Eq.(10) Using Eqs. (9) and (10), the distance threshold for new transformation space is given by Eq.(11) In Eq.(10) and Eq. (11), α is 0.3679.

F. DERIVATION OF PROPOSED FEATURE PATTERN SIMILARITY FUNCTION
Consider the two conditional probability feature pattern vectors X P i and X q i given by − → X P = (X P 1 , X P 2 , X P 3 , . . . . . . . . . ., X P m ) and − → X q = (X q 1 , X q 2 , X q 3 , . . . . . . . . . ., X q m ). The element values of the form X P i and X q i in the probability vector represented by − → X P and − → X q is the posterior probability value such that X P i ∈{0, 1}.
The membership value of − → X P to − → X q for i th feature dimension, i.e − → X P = (X p i ) and − → X q = (X q i ) can be obtained by applying the basic gaussian membership function as given by equation (12) The normalized membership value of feature pattern vector − → X P to − → X q by considering all the 'm' dimensions may be obtained using equation (13)

VOLUME 8, 2020
Substituting expression for µ i − → X pi , − → X qi represented by Eq. (12) in expression for normalized membership value represented by Eq.(13), we have the resulting expression for normalized membership value (also called as average membership value) given by Eq. (14) µ However, Eq.(14) cannot be considered as the similarity value as it defines the average membership value (or) average similarity between pattern vectors. So, similarity must be defined by some other function. To achieve this, we define the similarity function given by Eq. (15) to compute the similarity between − → X P and − → X q as where α is any constant. The value of α can be obtained by performing analytical analysis through analyzing for lowest and highest possible similarity values. Consider two cases to define the value of α namely i) worst case and ii) best case.

1) BEST CASE
In the best case, the similarity between X P i and X q i is unity So, the similarity between − → X P and − → X q is given by

2) WORST CASE
In the worst case, the similarity between X P i and X q i is exactly (or) almost equal to a zero, ∀i : 1 to m. The distance is hence equal to unity (which is the maximum). i.e (15), the similarity between − → X P and − → X q is given by Since, the similarity in worst case is zero. This means that 0.3679 + α = 0. This gives α = −0.3679. Now consider the expression for similarity given by Eq. (15). Substituting the value of α = −0.3679 in Eq.(15), we have Rationalizing Numerator and Denominator of Eq. (19), We finally get (20) Eq. (20), may be re-written as Eq. (21) Where α = 0.3679.
Using this relation, Eq. (22) gives the computation expression for the distance computation between − → X P and − → X q Hence proved.

IV. EXPERIMENTAL EVALUATION
All the experiments discussed in this section are conducted on DELL INSPIRON 15 5000 series having 32 GB RAM with Intel CORE i5 7 th generation CPU. For experimental analysis of the proposed machine learning method, we have considered the two widely used benchmark datasets. They are (i) KDD dataset which consist 41 and 19 attributes and (ii) NSL-KDD dataset which consist 41 attributes. Feature transformation is one of the most important preprocessing techniques which can improve the classifiers overall performance [19], [20]. By feature transformation technique, we mean that the attributes (or) features of the input dataset are projected on to another dimensionality space. The proposed feature transformation approach is based on generating feature clusters by considering the attribute belongingness to various class labels of the input dataset [1], [19]. The generated clusters using the proposed feature transformation technique represents the dimensionality of the transformed input dataset. For example, in our approach, we first cluster the attributes of the dataset into a finite number of clusters. From these clusters, their representative features such as mean and deviation are obtained. Using these representative elements of clusters, a matrix called as soft transformation matrix is obtained. The soft transformation matrix gives the similarity of each feature to each of these clusters. This soft matrix is used to obtain dimensionality reduced input dataset (or) a matrix that is an equivalent representation of the original matrix. It must be noted that in the original form, each observation is a function of attributes whereas in the transformed representation, each observation is expressed in terms of feature clusters.
The transformed dataset is then applied as input for various classifiers such as i) Naïve bayes classifier, ii) BayesNet classifier, iii) SMO classifier, iv) J48 decision tree classifier and v) KNN (k-Nearest Neighbors) classifier by choosing k-fold cross validation resampling technique for evaluating performance of the machine learning model. The evaluation parameters considered for performance evaluation are a) Accuracy In this subsection, we discuss the experiment results obtained by considering the equivalent dimensionality reduced input dataset which is the result of carrying feature transformation on the KDD-Cup dataset with 494021 observation instances defined over a feature set consisting 41 attributes. For all experiments, the similarity threshold is set to 0.9995 and initial deviation is set to 0.5. To evaluate the performance of the model, k-fold cross validation resampling technique is used by setting k value equal to 10. The result of feature transformation is 35 clusters. This means that each of the observations in the input dataset is now represented in terms of these 35 clusters. Experiments are conducted by considering classifiers such as i) Naïve bayes classifier, ii) BayesNet classifier, iii) SMO classifier, iv) J48 decision tree classifier and v) KNN (k-Nearest Neighbors) classifier. Figure 1 shows the J48 classifier confusion matrix which is obtained by considering the resulting dataset obtained after feature transformation. The classifier accuracies (in percentage) obtained is 99.97% for normal class and 99.99% for U2R, DoS, R2L and Probe classes. The percentage of correctly classified instances with J48 classifier is 99.967%.
A simple analysis of the confusion matrix shows that the percentage precision values for Normal class and attack classes (U2R, Dos, R2L, Probe) are 99.89%, 68.42%, 99.997%, 98.27%, 99.44% respectively. Similarly, the respective recall values of Normal, U2R, Dos, R2L, Probe classes are 99.95%, 50%, 99.99%, 95.82%, 99.34%. The accuracy, precision and recall values obtained for J48 classifier using the transformed input dataset representation shows that importance of the proposed approach. Figure 2 depicts J48 classifier accuracies before and after proposed feature transformation technique. It is visible from Figure 2 that the accuracies of KDD classes (i.e Normal, U2R, DoS, R2L, and Probe) obtained by considering the transformed dataset after feature transformation have improved when compared to the accuracies obtained by using original dataset with 41 attributes (i.e without feature transformation). After feature transformation technique, the dimensionality is reduced to 35 attributes. This proves the importance of   the proposed feature transformation technique. Thus, it is inferred from the above experiment that feature transformation has shown improvement in accuracies of each class of KDD dataset. Figure 3 depicts the J48 classifier precision values before and after proposed feature transformation technique. It is visible from Figure 3 that the precision values of Normal, Dos and Probe classes before and after feature transformation are same. However, after feature transformation technique is applied, there is considerable improvement w.r.t U2R and R2L attack classes. The precision value for U2R attack class is improved from 58% to 68% and R2L attack class is improved from 97% to 98%. Thus, this experiment once gain infers the importance of the proposed approach. Figure 4 depicts the J48 classifier percentage Recall values before and after proposed feature transformation technique. VOLUME 8, 2020  It is visible from Figure 4 that the precision values of Normal, Dos and Probe classes before and after feature transformation are same. However, after feature transformation technique is applied, there is a considerable improvement in four classes of KDD dataset except for R2L attack class. The recall value for R2L attack class is 95.8% after feature transformation whereas it is 96% before feature transformation is applied. For U2R attack class, the recall is increased from 36.53% to 50%. Hence, it is inferred that using proposed approach the accuracy, precision and recall values have all been better when compared to the values obtained on the KDD dataset without feature transformation for J48 Classifier. Figure 5 shows the KNN (K = 1) classifier confusion matrix which is obtained by considering the resulting dataset obtained after feature transformation. The percentage classifier accuracies obtained are 99.76% for normal class and 99.99% for U2R, 99.93% for DoS, R2L and 99.88% for Probe attack class. The percentage of correctly classified instances with J48 classifier is 99.75% for KNN with K = 1 which is slightly lesser than J48 classifier results. Using proposed method, the precision value of U2R is improved from 68.08% to 78.78%. However, the U2R class accuracy remained same whether (or) not the feature transformation is applied. For other classes, there is no much difference in terms of classifier accuracies. It is also observed from the experiments that the recall value of low frequency classes U2R and R2L have slightly decreased for KNN even after feature transformation is applied. However, the overall performance of classifier in terms of correctly classified instances is nearly the same. Figure 6 and Figure 7 respectively shows the Bayesnet classifier and Naïve Bayes confusion matrices which are obtained by considering the resulting KDD dataset (i.e 494021 instances with 35 dimensionality) obtained after feature transformation. Figure 8 shows the accuracy (ACC), precision (PREC) and recall (RECALL) values recorded from experiments for  Bayesnet classifier before and after applying proposed feature transformation technique. From the experiment values, it is observed that the accuracy, precision and recall values are improved for Normal and Probe classes when feature transformation technique is applied. Similarly, improvement in accuracy and precision values for U2R attack class, accuracy and precision values for DoS class are seen with proposed approach. The precision value of the Bayesnet classifier has shown improvement in terms of R2L class. For all other cases, though there is no improvement in values of precision, recall and accuracy, but these values remained same both before and after feature transformation. Thus, it can be deduced that the Bayesnet classifier has seen improvement in terms of overall classifier performance. In general, it is observed from experiments conducted that classifiers performance achieved by applying the proposed feature transformation technique has been better when compared to performance achieved without feature transformation.
This subsection discussed the classifiers performance before and after feature transformation by considering KDD dataset with 494021 observations with 41 attributes. The next subsections compares proposed approach to other recent approaches.

B. COMPARISON WITH UTTAMA and GARUDA
For all experiments discussed in this section, the similarity threshold is set to 0.9995 and initial deviation is set to 0.5 and 10-fold cross validation is considered to evaluate the model performance. Experiments are conducted to evaluate performance of proposed approach to UTTAMA [24]   and GARUDA [19]. UTTAMA [24] proposed by Arun et.al is an evolutionary feature clustering approach for network intrusion anomaly detection which uses fuzzy membership function for similarity computations. It is motivated from contributions [1], [19], [20]. The performance of proposed approach is compared to UTTAMA by considering various classifier evaluation parameters such as precision, recall, and correctly classified instances. Figure 9 shows the plot of percentage of correctly classified instances with the proposed approach and UTTAMA for KDD dataset with 494021 observations and 41 attributes. The overall accuracy of UTTAMA and proposed approaches are 99.982% and 99.99% respectively while the percentage of correctly classified instances is 99.952% and 99.97% for UTTAMA and proposed methods respectively. This proves that proposed method is better to UTTAMA. Figure 10 depicts the plot of weighted and class wise accuracies obtained using proposed approach and UTTAMA [24] for both the normal and attack classes of KDD dataset. It is observed that accuracies of both normal and attack classes using the proposed method are better when compared to UTTAMA. Experiment results obtained using the proposed approach for various classes are as follows: 99.97% for Normal, 99.99 % for U2R, 99.99% for DoS, 99.99% for R2L, and 99.99% for Probe which is comparatively very much better to UTTAMA. It is visible that U2R and R2L accuracies (i.e low frequency attack classes) are efficiently identified using proposed approach when compared to UTTAMA. Figure 11 gives the plot of accuracies obtained using proposed approach and GARUDA for each class. In [19], a feature clustering technique for reducing the dimensionality of the dataset is proposed which uses a distance function GARUDA. Here, we propose to use the proposed distance function for feature transformation instead of GARUDA. From experiments conducted, it is observed that a considerable improvement in terms of overall accuracy is recorded for Bayesnet, NaiveBayes and SMO classifiers. For J48 classifier with GARUDA, the accuracy is 99.82% whereas for the proposed approach, the accuracy is obtained as 99.97%. For KNN classifier with proposed approach, it is observed that the accuracy is 99.89% whereas for GARUDA with KNN, it is marginally higher value (99.91%). Overall the proposed approach has improved accuracies of classifiers when compared to feature transformation technique with distance function proposed in [19].
An interesting observation is that when proposed distance function is used for feature clustering and feature transformation, for J48 classifier, the accuracies of U2R and R2L attack classes are 99.99%, 99.98% for UTTAMA and 99.99%, 99.99% for proposed method. However, considering the precision value for these two attack classes, it is observed that the precision values of U2R and R2L attack classes are 78.94%, 96.43% for UTTAMA whereas for the proposed approach, we have obtained a precision of 68.42%, 98.27% for proposed method. From overall perspective, the performance of the proposed approach seems better when compared to UTTAMA.
Experiments are also conducted by considering KDD dataset with 19 attributes [19], [24]. Figure 12 shows the plot of overall and Classwise accuracies for all five classes of KDD dataset by considering UTTAMA and proposed approaches. It is observed that for KDD dataset [23] with 19 attributes the accuracies of all classes and overall classifier accuracy using proposed approach have seen improvement. The overall accuracy of UTTAMA using J48 classifier is 99.89% whereas using our proposed approach it is 99.94%.  Thus, the proposed approach has also achieved better accuracies on KDD dataset with 19 attributes.
Finally, the overall accuracy obtained using proposed approach and CANN [23] on KDD-19 dataset are compared. Using CANN approach for K = 1 [23], the overall accuracy achieved is 99.46% whereas the accuracy is 99.86% using our approach with K = 1. Thus, it can be deduced that the classifier accuracy of proposed approach is improved when compared to CANN. All these experiment results prove that the proposed approach for network intrusion and anomaly detection is better when compared to intrusion detection approaches GARUDA, CANN, UTTAMA.

C. NSL-KDD DATASET WITH 41 ATTRIBUTES
Experiments are also conducted by considering NSL-KDD dataset with 41 attributes. For experiments discussed in this section, the similarity threshold is set to 0.9995 and initial deviation is set to 0.5 and 10-fold cross validation is considered to evaluate the model performance. The dimensionality of the dataset is 36 after feature transformation. Figure 14 and Figure 15 shows the confusion matrix obtained for J48 and KNN (K = 1) classifier after performing feature transformation using proposed approach. The Classwise accuracy for each class is also shown in the last column of confusion matrix. From the confusion matrices of J48 and KNN shown in Figure 14 and Figure 15, it can be observed that the    classifier accuracies for U2R and R2L attack classes are very much better.
For instance, using J48 classifier, the accuracy for U2R and R2L classes are obtained as 99.97% and 99.92% and the corresponding U2R and R2L accuracy values for KNN classifier are 99.98% and 99.87% respectively. The precision, recall values for J48 and KNN classifiers are depicted in Figure 16 and Figure 17 respectively.
The F-score values can be computed from precision and recall values depicted in Figure 16 and Figure 17

V. CONCLUSION AND FUTURE WORK
In this paper, we have applied the proposed distance function for carrying feature clustering and to achieve feature transformation. Thus, dimensionality reduction is carried via feature transformation. The distance function proposed in this work is designed by considering the basic gaussian membership function. After achieving dimensionality reduction using proposed feature extraction technique, we have applied classifier algorithms for evaluating performance of the classifiers on the transformation datasets. Several experiments are conducted on KDD dataset with 41 and 19 attributes and the performance of classifiers is evaluated. Experiment analysis proved that the performance of the proposed approach is comparatively very much better and has achieved an improved performance interms of accuracy, precision and recall parameters. One of the significant findings and important outcomes of the proposed approach which is derived from the experiment results is that the accuracy and precision values of low frequency attack classes have substantially improved. This work is limited to proposing a new distance function and applying the proposed distance function for feature clustering and transformation so as to prove the importance of distance functions in machine learning model and also to show how a comparatively better performance may be achieved by classifiers, if an appropriate distance function is employed. Experiments are performed on KDD dataset with 41 and 19 attributes and NSL-KDD dataset with 41 attributes by considering several classifier algorithms. Classifier performance is evaluated in terms of accuracy, precision, recall and F-Score parameters. Experiment results and analysis proved that our approach for anomaly detection using proposed feature transformation technique proved to be better when compared to other detection methods that are addressed in the literature. As a future extension of the present work, we are currently studying the possibility of designing new decision tree based classifiers.