Design and Testing Novel One-Class Classifier Based on Polynomial Interpolation With Application to Networking Security

This work exploits the concept of one-class classiﬁer applied to the problem of anomaly detection in communication networks. The article presents the design of an innovative anomaly detection algorithm based on polynomial interpolation technique and statistical analysis. The innovative method is applied to datasets largely used in the scientiﬁc community for bench-marking such as KDD99, UNSW-NB15 and CSE-CIC-IDS-2018, and further evaluated with application to a novel available dataset EDGE-IIOTSET 2022. The paper also reports experimental results showing that the proposed methodology outperforms classic one-class classiﬁers, such as Extreme Learning Machine and Support Vector Machine models, and rule-based intrusion detection system like SNORT. With respect to binary classiﬁers, this work has the advantage of not requiring any a-priori knowledge about attacks and is based on the collection of only normal data trafﬁc.


I. INTRODUCTION A. MOTIVATIONS
For the development of Anomaly and Intrusion Detection Systems (ADS ad IDS) there is a growing interest in the use of Machine Learning (ML) and Artificial Intelligence (AI) concepts. Some recent works as [1]- [3] use the ML and AI-based methodologies to explore the various ways of detecting malicious attacks in the computer's networks. In other works in literature for cybercrime [4]- [7], it has been already demonstrated that ML and AI-based methodologies have the potentiality to outperform rules-based IDS tools, such as SNORT and WIRE-SHARK. This is mainly due to the flexibility of ML/AI models. In fact, rules-based algorithms need a very deep knowledge about attacks mechanism in order to elaborate a specific recognition path. This represents a non flexible design process since it is required to The associate editor coordinating the review of this manuscript and approving it for publication was Firooz B. Saghezchi . define a rule for each of the possible anomalies. This is unfeasible because the number of attack classes grows every day. Instead, ML/AI-based models do not require too much knowledge about the attack and its mechanisms since based on collected data the attacks can be grouped, characterized and then recognized by a ML/AI algorithm. The process of learning from data is very flexible with respect to classic rules-based approach. Indeed, if the condition change the model can be re-trained on new data in order to recognize a new class of attack. A major limit to the ML/AI-based design process is the a priori knowledge about the relationship between the attack classes and the collected data observations. In many applications this represents a practical limit since it could be hard to collect anomalous data traffic, because of the impossibility to replicate some attack classes. This is often due to the non specific knowledge of system designer about the possible attacks. To overcome this limit a new approach based on one-class classifier is proposed in this paper. Indeed a one-class classifier [8] does not require any knowledge about attacks and is based on collection of only normal data traffic in the communication system of interest. The normal traffic is characterized and is elaborated by a threshold-based logic to determine if the next observations are normal or anomalous.

B. RELATED WORKS
As anticipated, most of the works in which an IDS based on ML/AI techniques is presented, exploit supervised learning paradigms, where a priori knowledge about the anomalous behaviour and the specific anomaly types is required. In addition, many works limit themselves to testing their algorithms on a single dataset, limiting the validity of the achieved results. For example, in [4], the authors mainly analysed the problem of binary classification on a single dataset. This obviously requires a priori knowledge of anomalous and normal traffic. This problem can be solved by means of one-class classification algorithms. The one-class method is also robust to new type of attacks. In [5]- [7] the use of ML models is presented and compared in terms of performance, but with reference only to the use of the KDD99 dataset (or modified versions). However, there is no contribution to the development of innovative techniques but simply the use of existing models combined with known data manipulation/ reduction techniques. The most critical points, however, are the failure to compare these models on different types of dataset, which is in fact also the only way to verify the validity of the performance analyses and the need to draw on an already labelled dataset to recognise the various types of attack. Similarly, in [9]- [13] supervised learning of known models in the literature is used, where minor modifications in the numerical optimisation algorithms are proposed, referring only to the UNSW-NB15 dataset for the evaluation of the obtained performance. Similar arguments apply to other works in the recent literature, such as [14]- [16], in which the authors present results related to the performance of proposed methods or classical ML/AI models considering only one type of dataset such as the CSE-CIC-IDS-2018. The great limitation of the proposed techniques lies in the fact that a priori knowledge of the types of threats that can affect the computer network is required. This knowledge is not always easy to access. Furthermore, previous Anomaly and Intrusion Detection approaches [4]- [7], [9]- [16] can be easily bypassed by new attack techniques. Few works propose the use of one-class models, and even fewer propose a comparison across multiple datasets, as we propose in this paper. For example, in [17], [18] the use of Extreme Learning Machines (ELM) is proposed as an alternative to Auto-Encoders based on artificial neural networks, with the aim of decreasing learning times, memory requirements for saving weights in memory and computational complexity, in the sense of waiting times during the processing of new observations. The main problem with ELM models is that the transformation matrix is based on random processes that often do not apply, with a strong dependency on the analysed dataset. In literature there are also works [19], [20] proposing the use of Support Vector Machines (SVM) in a one-class version. However, these SVM-based works are characterized by higher waiting times and often modest performance in terms of False Positive Rate.

C. CONTRIBUTIONS
To overcome the limits of the state of art this paper proposes an anomaly detection technique based on the concept of pre-processing features reduction, polynomial interpolation and one-class model that needs only normal behaviour data.
The contributions of this work are the following: • design of an innovative one-class technique based on numerical computing algorithms combined with statistical analysis and machine learning.
• higher performance than other one-class techniques in the literature, tested on the three most commonly used datasets overall and on a very recent dataset representative of IIoT applications.
• accuracy essentially in line with results reported in the literature where binary classifiers are used, highlighting the most important strength of the proposed method, related to the non-need for a priori knowledge in terms of collected observations of anomalous behavior.
• exhaustive analysis of the algorithm's robustness and independence from the dataset on which it is applied, proposing the test on the KDD99, UNSW-NB15, CSE-CIC-IDS 2018 and EDGE-IIOTSET 2022 datasets.

D. PAPER ORGANIZATION
The paper is organized as follows. Section II reviews the selected datasets (KDD99, UNSW-NB15, CSE-CIC-IDS-2018 and EDGE-IIOTSET 2022) used for the design and verification phase and to assess the portability of the proposed technique in different scenarios. Section III describes the proposed algorithm which includes multiple steps such as preliminary data-set manipulation, application of features reduction techniques, polynomial interpolation for normal behavior recognition and final one-class decision policy. Section IV presents the achieved results when applying the novel proposed to the selected datasets. In Section V we report a discussion on the obtained performance, proposing an interpretation based on quantitative and graphical results. Finally, in Section VI we report the conclusion on the proposed novel approach and considerations on future works.

II. A BRIEF DESCRIPTION OF THE SELECTED DATASETS
In this section we reports a brief description about the dataset on which the proposed method is applied. The choice of the selected datasets derive from a preliminary study of the state of the art on IDS issues, reveling that KDD99, UNSW-NB15 and CSE-CIC-IDS are the most important benchmarks for evaluating new algorithms [21]- [29]. We also propose to apply the innovative proposed methodology to a novel dataset generated and released in 2022, representative of IIoT applications. VOLUME 10, 2022 A. KDD99 The Kaggle version of the KDD Cup 99 [30], [31]  The UNSW-NB15 dataset [32] is available online dataset, free to use and globally acknowledged as a valid benchmark for testing intrusion detection systems. The UNSW-NB15 dataset was created by the IXIA PerfectStorm tool in the Cyber Range Lab of Canberra University for generating a hybrid of real modern normal activities and synthetic contemporary attack behaviours. The tcpdump tool was utilised to capture 100 GB of the raw traffic. This dataset presents 9 types of attacks. The Argus, Bro-IDS tools are used and twelve algorithms are developed to generate totally 49 features. The total number of records in the dataset is 2540047, split in 4 .csv files. In the repository, besides the .csv files, it is possible to find also raw traffic records as .pcap files. The dataset size is 2540049 × 49, i.e. about 125 millions 0f elements.

C. CSE-CIC-IDS-2018
The CSE-CIC-IDS-2018 is an online available dataset widely used in literature for testing and evaluation the performance of ADS/IDS crated by University of New Brunswick. It includes seven different attack scenarios: Brute-force, Heartbleed, Botnet, DoS, DDoS (Distributed DoS), Web attacks, and infiltration of the network from inside [33], [34]. The attacking infrastructure includes 50 machines and the victim organization has 5 departments and includes 420 machines and 30 servers. The dataset includes the captures network traffic and system logs of each machine, several features extracted from the captured traffic using CICFlowMeter-V3. In CSE-CIC-IDS 2018 dataset, the authors use the notion of profiles to generate datasets in a systematic manner, which will contain detailed descriptions of intrusions. The dataset size is 16233002 × 80, i.e. about 1.3 billions of elements.

D. EDGE-IIOTSET 2022
The EDGE-IIOTSET 2022 is new available online dataset [35]. In this work a new IoT and IIoT dataset collected from seven-layer tested including more than 10

III. POLYNOMIAL-BASED ONE-CLASS CLASSIFIER DESIGN A. METHODOLOGY DESCRIPTION
Our approach is based on the idea of being able to extract the polynomial features from the dataset of normal traffic observations. In that way, it's possible to characterize an area of normal behaviour of the system based on the polynomial interpolation over features used as a training dataset. The normality area is then defined as the bounded area of the upper and lower extremes of the polynomials belonging to the training datset. Once the normality area has been defined, it is safe to assume that the polynomials of the normal traffic data not belonging to the training dataset are contained within the normality area. This means that most (if not all) of the points of the polynomial extracted from a new normal traffic observation belong to the area extracted in the training phase. At the same time, an observation of abnormal traffic (e.g. an attack) must be difficult to overlap with the area of normal observations. This must mean that the number of points of the polynomial extracted from the attack that lie within the normal area will be noticeably smaller than the points of the polynomials extracted from the anomalous traffic. Thus, by defining an anomaly threshold, it is possible to distinguish whether a new observation is an attack or not. Using this kind of paradigm, it is not necessary to know all kind of traffic types but it is enough to collect just the normal observation to develop a one-class classifier. The various steps of the algorithm are developed and will be described in detail in the following sections.

B. PRELIMINARY DATASETS MANIPULATION
Independently on the dataset to which our workflow is applied, some manipulation operations of the feature values are performed, in order to make the subsequent steps easier. One of the first manipulations consists in assigning numerical values to features made up of symbolic values, such as the Timestamp or the number of the communication port. In the proposed method, the remapping of variables, of the ''Label Encoding'' type [36], has been adopted. The reason for this choice is to keep the number of features in the datasets as low as possible at each operational step. Another necessary manipulation relates to reducing the likelihood of bad calculations. In particular, the ''min-max'' normalisation procedure is applied to the columns of the dataset, as shown in Equation. 1.
where f denotes the starting column vector and f i the component i th , whose value is reassigned according to the definition given. Following the normalisation of the feature values, observations in which NaNs are present, which have no information content, are removed.

C. APPLYING FEATURES REDUCTION
In order to preliminarily reduce the number of features in the datasets, the PCA (Principal Component Analysis) [37] and MDS (Multi-Dimensional Scaling) [38] techniques are applied simultaneously during the dataset preparation and learning phases.

1) PRINCIPAL COMPONENT ANALYSIS
Starting from the original dataset in matrix form, as reported in the following.
where A k is the k t h column vector of the matrix A, for each of the dataset column it is computed the mean value, organized in a column vector that must be subtracted to the A k itself.
Then is defined the matrix B, that is basically the matrix A, in which in each column is element-wise subtracted the mean value. This facilitates the computation of the co-variance matrix.
The co-variance matrix of the original dataset A can be defined in terms of the B matrix columns as reported hereafter.
where each component of the co-variance matrix has the following form.
Once the co-variance matrix it is computed, it is needed to derive eigenvalues and eigenvectors, in order to identify new features representation space and select the so-called principal components. The principal components are defined as the minimum features that contain most of the information with respect to the original dataset.
The new representation space is derived by multiply the matrix A with the eigenvectors matrix, as below.
Multi-Dimensional Scaling is an alternative feature transformation/reduction technique, based on different metrics with respect to PCA. The starting point is the same, the original dataset in matrix form A. The ''similarity'' matrix it is defined as follow.
It is defined also a ''double centering'' matrix C, that will be useful to build the matrix for the change of features representation space.
Then is defined the ''barycenter matrix'' B applying the following congruence linear transformation to the similarity matrix. The eigenvalues and eigenvectors of such matrix are computed, in order to derive the matrix for the features representation space changing T .
In particular the matrix T is the collection of the eigenvectors related to the m eigenvalues with higher values. VOLUME 10, 2022 The dataset represented in the new features space can be obtained with the following computation.
Following the procedure for reducing the number of features, based on one of the two methods described above, we propose the use of a polynomial interpolation technique as a further manipulation of the data and to define a decision criterion. In particular, polynomial interpolation is proposed, in our workflow, as an additional criterion for feature reduction and transformation, so as to introduce a further degree of uniqueness for normal behaviour. The degree of the chosen polynomial is less than the cardinality of the dataset in the face of manipulation procedures (normalisation and elimination of observations with NaN) and reduction by transformation of the representation space (via PCA or MDS). As a polynomial interpolation technique we use the least squares criterion with equally spaced interpolation nodes [39]. In Eq. 2 is reported briefly the Least Mean Squares methods from which the polynomial coefficients are derived.
For each observation i th , and for each coefficient the confidence interval is calculated, as reported in the set of expressions in Eq.3.
Thus the upper and lower 'boundary' polynomials coefficients can be defined as described in Eq.4.
Then are defined the upper and lower polynomial curves of the normality bound, as reported in Eq.5 To decide if an observation is an anomaly in the polynomial representation domain, we evaluate the total number of points that the polynomial curve related to the current observation remains out of the bound, along the entire interval f 1 , . . . , f n . In order to calculate the number of times the polynomial for the i-th observation remains outside the normality interval, it is sufficient to compare the polynomial with the normality bundle itself, in the definition interval.

E. DECISION POLICY
The decision process we propose is based on the comparison between the interpolating polynomial associated with a new observation to be analysed (in fact we talk about the ''on-line'' phase) p obs (x) and the normality limits p upper (x), p lower (x). In particular, the number of times p upper (x) is out of normal limits is evaluated. The threshold for applying this decision criterion is evaluated in the offline phase, in order to derive the 100% performance on the portion of the dataset used for the construction of p upper (x) and p lower (x). Note that this threshold depends on the dataset under consideration. The evaluation of the values assumed by the polynomials is obviously done within the range of features derived from the PCA/MDS procedure, where the index of each feature is also an interpolation node.
In symbols, given f new = [v 1 , v 2 , . . . , v n ] the vector containing the values of the features in the initial representation base, this is processed by means of the PCA or MDS technique, which as described above is basically equivalent to a multiplication with a matrix for the change of representation base. Given M T ∈ R n×m the transformation matrix, we obtain a transformed observation with m ≤ n. The components of the vector f are used to derive the interpolating polynomial p new_obs = a 0,new + a 1,new f + . . . + a m,new f m , which is eventually processed by the function implementing the decision logic.
The decision-making policy can be summarised with the pseudo-code representation shown in List 3.

F. PERFORMANCE INDEXES
For the evaluation of the models detection performance we use classic index of the ROC (Receiver Operating Characteristic) analysis [40], as reported in the following.     In particular, Figure 1 represents the workflow in the construction phase of the boundary polynomials, p upper (f ) and p lower (f ), necessary for the decision process to be applied in the ''on-line'' phase. In the ''off-line'' phase, the observations of the portion of the dataset used for the construction of the limit polynomials are processed to derive the scaling factors, min( f ) and max( f ), for each of the columns. Thereafter there is a coordinate transformation by application of PCA or MDS technique. The choice of PCA over MDS depends on the dataset, in particular, in our proposed flow the choice falls on the technique that reduces the size of the representation space the most. As a final step, we apply polynomial interpolation, from which we derive the coefficients needed to define the boundary polynomials to define the normal behaviour. The degree of the polynomial will be less or at most equal to the number of features downstream of PCA/MDS. The set of coefficients of the polynomials represent a new feature base. And being a non-linear application of the features derived from PCA/MDS, they also represent a further degree of uniqueness and hence of characterisation of normal versus abnormal behaviour.
As shown in Figure 2, the classification procedure (''online'' because it acts on a single new observation) inher-VOLUME 10, 2022 its the parameters for the characterisation of normal traffic, in order to apply a decision criterion based on the polynomial description of each new observation. In particular, the new observation is scaled with the values min( f ), max( f ) and subsequently transformed through the coordinate transformation matrices derived from the PCA/MDS reduction technique, from which an interpolating polynomial is derived to be finally compared with the boundary polynomials of the normal behaviour. The decision-making policy is based on calculating the points outside the previously calculated normality limits, comparing the polynomial associated with the new observation. Once a certain threshold, derived from preliminary statistical analysis, is exceeded, a decision is made to classify it as an anomaly.

IV. EXPERIMENTAL RESULTS
This Section shows the results obtained by applying the method proposed in Section III to all 3 datasets described in Section II. The first step is applying the feature reduction techniques discussed in Section III.B As shown in Fig.4, Fig.5 and Fig.6, the best result between PCA and MDS for the preliminary reduction of the problem size, depends on the dataset. In general, it is not possible to say a priori which one between PCA and MDS reduces the number of features more.
In Figure 4 it is shown that through the PCA technique 18 features have been obtained, in the new representation base. In particular we start from 42 features for the original KDD99, passing to about 70 features after applying the procedure of Encoding of the features to qualitative values, to then return, as shown, to 18 features to maintain at least 95% of the initial informative content.
In Figure 5 it is shown that through the PCA technique 20 features have been obtained, in the new representation base. In particular we start from 49 features for the original UNSW-NB15, passing to about 120 features after applying the procedure of Encoding of the features to qualitative values, to then return, as shown, to 20 features to maintain at least 95% of the initial informative content.
In Figure 6 it is shown that through the MDS technique 8 features have been obtained, in the new representation base. In particular we start from 80 features for the original CSE-CIC-IDS-2018, passing to about 150 features after applying the procedure of Encoding of the features to qualitative values, to then return, as shown, to 8 features to maintain at least 95% of the informative content.
In Figure 7 it is shown that through the PCA technique 14 features have been obtained, in the new representation base. In particular we start from 63 features for the original EDGE-IIOTSET 2022, passing to about 69 features after applying the procedure of Encoding of the features to qualitative values, to then return, as shown, to 14 features to maintain at least 95% of the initial informative content.
In Table 1 are summarized the number of new features obtained for each of the dataset in front of PCA and MDS technique application. To be noted that PCA achieves good performance in terms of feature reduction also for CSE-CIC-IDS-2018 and EDGE-IIOTSET 2022; hence PCA is a suitable technique to be adopted if the same feature reduction method must be used for all the different datasets.
Reasonably, the reduction in the number of features so marked is due to the fact that the original dataset is based on the characterisation of the flow of data and packets but also on the topology of the sub-net in which the data traffic circulates (i.e. on the numbers of input/output ports which are probably interpreted by the PCA/MDS as being of little informative significance).
The next step is to construct the polynomials containing the normal behaviour, which describe the upper and lower limits over the entire interpolation interval.
In the procedure to build the correct polynomial model for features interpolation are considered only the results obtained from the best preliminary feature reduction technique.
The quality of the result depends strongly on the degree of the chosen polynomial, which can be interpreted as a hyper-parameter of our method. This choice also depends on the dataset on which the classifier is applied. Figure 8 shows the analysis of variation in the choice of the degree of the interpolating polynomial, in the case of the application of our method to the KDD99 dataset. The figure shows how the choice of this parameter is important for the efficiency in the process of discrimination between anomaly and normality of the analysed traffic. For example, in the particular case of using KDD99, the degree of the polynomial chosen to interpolate the values of the 18 features resulting from the PCA procedure is 10 (top right in the Figure 8).
Similarly, Figure 9 shows the graphical analysis of the different choice of the degree of the interpolating polynomial for the construction of the normality limits. In the particular case of UNSW-NB15, downstream of the PCA procedure there is a decrease of the problem size up to 20 features. Consequently, as can be seen in Figure 9, the best choice in terms of the degree of the interpolating polynomial is 10. Obviously the decision on the degree of the polynomials is made against the evaluation of the performance indices, in fact the graphical analysis serves for clarity of exposition to the reader.
As highlighted in Fig.8 and Fig.9, the quality of the interpolation depends on the choice of the degree of the polynomial. In fact, for too high degrees the typical Runge [41] phenomenon is revealed, also due to the choice of equally spaced interpolation nodes.
Quite analogous is the situation shown in Figure 10, where the interpolation result is compared when varying the degree of the polynomial for application to the CSE-CIC-IDS-2018 dataset. In this case, the feature reduction via MDS reaches up to problem size in the new representation base equal to 8, and the interpolating polynomial that returns the best result in terms of ROC performance indices, is 6.
In Figure 11 four polynomial interpolations about EDGE-IIOTSET 2022 are shown. As done with the previous three cases, the interpolations done are compared between    them in order to determinate the best polynomial degree that maximizes the ROC performance indices. In this case the best result is obtained with the polynomial degree equal to 8.
We would like to emphasise again that in the proposed workflow, the construction of the boundary polynomials for traffic normality is based only on data classified a priori as normal in the original datasets. In no way anomalous observations come into play in the process of constructing polynomials and decision thresholds. Table 2 shows the results obtained by applying the technique proposed in this paper (Poly) vs ELM and SVM used as a one-class classifier. We denote ''Poly BR'' if the ''best reduction'' technique is chosen, and ''Poly PCA'' if PCA is chosen a priori. In Table 2 it is also reported the results obtained in case of non-optimal features reduction. In particular, for ''Poly PCA'' in case of CSE-CIC-IDS 2018 there is a bit of degradation in absolute performance but ''Poly PCA'' is still outperforming the state-of-art methods like ELM and SVM. This result suggests that it is possible to choose PCA a priori. In this way there is a less dependability from the dataset itself. Note also that the results in Table 2 are for the best choice among those shown, in terms of the degree of   the interpolating polynomial, and are for the ''off-line'' phase (i.e., the testing phase).
The comparison shows that the performance of our proposed method outperforms the other two models for all the considered datasets (KDD99, UNSW-NB15, CSE-CIC-IDS-2018 and EDGE-IIOTSET 2022) and for all the metrics defined in Section III.E. The most interesting result is the reduced false positive rate FPR compared to EML and SVM, which is in fact one of the crucial points in the development of new algorithms for anomaly detection.
It can also be stated that our results are fully comparable with the state of the art on ADS/IDS systems based on binary classifiers, which need to be trained with inputs from both classes (normal) and anomaly. In fact, [4]- [7] adopts KNN and ANN models and obtains an accuracy of 97% (only on KDD99 dataset); in [9]- [13] and [42], authors report the results of binary classifiers applied on UNSW-NB15 highlighting mean accuracy level around 95% with also some high FPR rate; in [14]- [16] and [43] it is reported a comparison of supervised machine learning (SVM, DT, DA) and deep learning (ANN, CNN, Autoencoder) models applied to CSE-CIC-IDS-2018, reveling an accuracy level close to 98%, for binary classifiers; in [35] the authors report the results of binary classifier applied on EDGE-IIOTSET 2022 using different type of machine learning (DT, RF, SVM) and deep learning (DNN) methods that provide an accuracy level of 99%. Summarizing, our method reaches a very similar level of detection performance, with low FPR respect some of literature results, with the advantage of no requests in terms of a priori knowledge on anomaly behaviour.
Note that, in order not have dependency issues with respect to the computational platform, the SVM and ELM models have been re-implemented following the design specifications reported by the authors cited in Section A.II.   The datasets were also processed through SNORT, configured through ''community rules'', obtaining much lower results in terms of accuracy. For KDD99 SNORT achieves about 61% accuracy, much lower than the 97.83% of our method; for UNSW-NB15 SNORT achieves about 39% vs. 96.59% of our method; for CSE-CIC-IDS SNORT achieves about 44% vs 95.5% of our method and for EDGE-IIOTSET 2022 achieves about 50% vs 97.27% accuracy level of our method. Note that SNORT can certainly be configured with custom rules to obtain better results. However, this highlights how rule-based tools are limited and lack flexibility for a user without specific knowledge of attack mechanisms. Summarizing, the method we propose is much more flexible than rule-based tools like SNORT or classic methods based on supervised-learning. Even in case of a new anomaly (never seen) the behavior will tend to go out of the confidence interval defined by p upper and p lower . Therefore, even without knowing the specific mechanism of the new anomaly, it is possible to detect it.
A further analysis of the performance of the proposed method (Poly BR) is summarised in Table 3, in which it was studied how much time (average) is needed for the algorithm to process each new observation, intended as a feature vector.
The processing time of the proposed method is compared to those of the SVM and ELM one-class classifiers used as benchmarks. It should be noted that this processing time depends on the dataset, as operating times certainly depend on the numbers involved. The achieved results show that the processing time required by the proposed method (Poly BR) are lower than those for SVM and ELM classifiers for the UNSW-NB15, CSE-CID-IDS-2018 and EDGE-IIOTSET 2022. For the KDD99 dataset the processing time of the proposed method is comparable to the ELM technique and lower than the SVM classifier.
Computational times were also tested for a not optimal choice of the preliminary technique of feature reduction, highlighting that for KDD99, CSE-CIC-IDS-2018 and EDGE-IIOTSET 2022 the differences are not appreciable,  while there is an increase in processing times in case of UNSW-NB15, of about +10%. Reasonably, this is due to the difference in the number of features obtained downstream of the PCA and MDS techniques for the specific dataset. Obviously the difference in choice between PCA and MDS does not affect linearly the processing time, so the deterioration remains limited.
Notice that to verify that the processing time comparison between Poly, SVM and ELM is platform-independent, the test in Table 3 has been applied on two different processors, and we achieved the same results. The testing platforms were an Intel Core i3-6300 CPU 3.80 GHz with two cores (the one used to achieve the results in Table 2) and an Intel Core i7-8550U CPU 1.80 GHz with four cores.

V. RESULTS DISCUSSION
As deeply discussed, the first step of the proposed method requires evaluating the Features Reduction techniques, in particular, we proposed PCA and MDS. Both techniques drastically reduce the number of features with respect original dataset and simplify the next phase of polynomial interpolation design. As shown in the previous section, the Features Reduction process strongly depends on the dataset, with widely different results. As discussed, the best choice is based on which one reduces the original dataset, maintaining the same information amount. Experimental results highlight that in KDD99 the two techniques have quite similar features in space reduction (18 with PCA vs. 22 with MDS); in UNSW-NB15 is highlighted a wide difference between the two techniques (20 with PCA vs. 36 with MDS); in CSE-CIC-IDS-2018, the two methods provides practically the same space reduction (10 with PCA vs. 8 with MDS); in EDGE-IIOTSET 2022 there is a wide difference between the two methods (15 with PCA vs. 42 with MDS). If for the user is necessary to define only one reduction technique, PCA results as the best choice for generalising the reduction phase. In this work, we propose polynomial interpolation as a further features selection & reduction method, applied after PCA or MDS. The idea of applying polynomial interpolation is to further empathize the differences between normal traffic behaviour and anomalies, introducing non-linear transformation. The polynomial interpolation phase requires an assessment procedure for the best choice of polynomials degree. Classification results are of course strongly dependent on the polynomial degree. The best choice is based on the most efficient combination of performance indexes (from ROC analysis). In particular, in KDD99 and UNSW-NB15 the degree of the optimal polynomial results is equal to 10 while in CSE-CIC IDS 2018 the optimal degree is 6 and in EDGE-IIOTSET 2022 the optimal degree is 8. In term of data interpretation, we can suppose that the complexity of the dataset increases as the number of features  This fact depends on the information content. Since each dataset was generated from different type of network traffic analyzer, the information content results a priory different. The data in CSE-CIC IDS 2018 are basically traffic statisticals and in EDGE-IIOTSET 2022 a big amount of records present are composed by zeros, meanwhile for KDD99 and UNSW-NB15 data are more related with nodes' interconnection inside of the same network and they have for each row a low numbers of zeros. From our point of view KDD99 and UNSW-NB15 result more complex than CSE-CIC IDS 2018 and EDGE-IIOTSET 2022 regarding to data interpretation. For further interpretation of the proposed workflow and results, several graphics on the features reduction and polynomial choice analysis are shown. The ROC indexes analysis reveals that the proposed method, based on an innovative one-class classifier, obtains higher performance concerning the SVM and ELM models. Notice that our method is based on statistical learning theory and numerical methods, which of course increase the interpretability of the entire workflow concerning fully AI-based methodology. Moreover, the best results in terms of computational time analysis suggest that the proposed method has a reduced computational effort than SVM and ELM, making it appropriate even for Embedded applications, unlike most of the work presented in the literature.

VI. CONCLUSION AND FUTURE WORKS
In summary, this paper reports the procedure for the design of a one-class classifier, based on the concept of polynomial interpolation as a mathematical tool to insert uniqueness in anomaly recognition, which is not actually used in the literature. The entire workflow was presented, both from a formal and operational point of view, highlighting the advantages over the one-class classifiers used in the literature, such as ELM and SVM. We have shown the experimental results obtained by applying our proposed method on the four datasets. KDD99, UNSW-NB15 and CSE-CIC-IDS-2018 are most used in the literature for testing anomaly detection algorithms. The EDGE-IIOTSET 2022 is one of the newest dataset created by traffic extracted from a real IIoT network. We have shown that the algorithm we have developed achieves higher performance than the classical ELM and SVM that represent the standard for one-class classifiers. We also studied the computational complexity of the algorithm in terms of processing time for each observation, noting that even in this aspect, compared to SVM and ELM, the results are better. The proposed technique has been also compared for all considered datasets to rule-based IDS like SNORT and the achieved results show that our one-class classifier with polynomial interpolation leads to a much better accuracy. The presented work certainly leaves room for further extension and elaboration of the procedure. One of the goals is to define a version suitable also for multi-class problems in order to compare our workflow with ML/AI models based on supervised learning flow. Furthermore, the implementation on embedded platforms will be addressed in order to deal with safety problems also in applications of a different nature, such as in-vehicle communication systems and mechatronic systems of industrial interest.
As discussed in detail in the previous sections, the paper focuses on the design of a one-class classifier, as it is more important to detect anomalies rather than to specify their type. As attack scenarios are constantly evolving, this approach is crucial in safety-critical application contexts such as defense. In any case, as a future development of our proposed innovative method, it is certainly interesting to extend to the multi-class case. In particular, we propose the conceptual architecture shown in Figure 12.
Such architecture exploits the one-class method on several branches. In particular, assuming that the mechanisms of the anomalies are perfectly known and a sufficiently large dataset can be collected, it will be possible to extrapolate specific features that can be associated with the different predicted anomaly classes. Each branch of the architecture will thus handle membership in each specific class of the classification problem. The output of each of the branches will be the number of points within the confidence interval relative to the similarity with the polynomial associated with that class. This information is in fact quite equivalent to the output of a SOFTMAX layer in a neural network, which instead provides an estimate of the probability of membership in one of the classes of the problem. In the proposed architecture, the maximum among all values related to each branch will certainly be associated with the class that is closest in terms of polynomial representation.

ACKNOWLEDGMENT
(Pierpaolo Dini and Sergio Saponara contributed equally to this work.)