Classification of Abnormal Signaling SIP Dialogs through Deep Learning

Due to the high utilization of the Session Initiation Protocol (SIP) in the signaling of cellular networks and voice over IP multimedia systems, the avoidance of security vulnerabilities in SIP systems is a major aspect to assure that the operators can reach satisfactory readiness levels of service. This work is focused on the detection and prediction of abnormal signaling SIP dialogs as they evolve. Abnormal dialogs include two classes: the ones observed so far and thus labeled as abnormal and already known, but also the unknown ones, i.e., specific sequences of SIP messages there were never observed before. Taking advantage of recent advances in deep learning, we use Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) to detect and predict dialogs already observed before. Additionally, and based on the outputs of the LSTM neural network, we propose two different classifiers capable of identifying unknown SIP dialogs, given the high level of vulnerability they may represent for the SIP operation. The proposed approaches achieve higher SIP dialogs detection scores in a shorter time when compared to a reference probabilistic-based approach. Moreover, the proposed detectors of unknown SIP dialogs achieve a detection probability above 94%, indicating its capability to detect a significant number of unknown SIP dialogs in a short amount of time.


I. INTRODUCTION
Currently, the Session Initiation Protocol (SIP) plays a fundamental role as a signaling protocol of IP Multimedia Subsystem (IMS) services [1] and Voice over Internet (VoIP) services [2]. Apart from the classical vulnerabilities mainly associated with authentication schemes, the SIP protocol can also be explored by malicious users to take advantage of the request/response interaction integrated into the sequential behavior of the protocol. The exploration of different signaling patterns can effectively expose vulnerabilities of the multiple SIP servers in the SIP path, which can then be used to perpetrate novel types of attacks [3]- [5] known as SIP signaling attacks. Consequently, there are SIP signaling attacks already known, which can be detected or predicted through the sequence of SIP signaling messages already exchanged between the peers of the established session. Additionally, new types of attacks can be perceived by unseen sequences of SIP signaling sequences, highlight the importance of detecting SIP sequences never observed before.
Motivated by the importance of predicting or detecting SIP signaling sequences established between SIP peers, in this work we study the performance of state-of-the-art deep learning techniques to detect SIP sequences already known and also unknown SIP sequences. More specifically, Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) are used to classify a sequence of observed SIP messages into a known SIP dialog. Additionally, the outputs of the LSTM neural network are used to detect if the current observations are part of a known or unknown SIP dialog. When compared to the recent literature on the detection of abnormal SIP dialogs, the innovation of our work is mainly centered on the adoption of the LSTM neural networks and the way they are applied to the SIP specifics, as well as the performance gains reported in this work. As far as we knew, our work is the first one characterizing the performance of LSTM deep learning structures for abnormal SIP detection and prediction. The innovative aspects of the paper are listed as follows: VOLUME X, 20XX • We propose a deep learning scheme capable of recognizing the SIP signaling patterns of observed SIP sequences already known. The learning structure is formed by either one or two recurrent neural network layers to model the relation of the temporal events, and an output dense layer to identify the SIP signaling pattern; • The outputs of the recurrent neural network are used as a set of features to identify the signaling patterns incorrectly predicted. Two classifiers are proposed to distinguish between trained dialogs and unknown dialogs either through the skewness and kurtosis central moments or the maximum value of the neural network outputs; • The proposed solutions to predict or detect the SIP dialogs are evaluated through different experiments and metrics to identify the appropriate neural network structure and its hyperparameters. The experiments assess the ability to predict SIP signaling sequences while the messages are received over time and the capacity of detecting SIP dialogs already observed in the past. Additional experiments are taken to identify the most accurate classifier. The experiments assess the probability of correctly identifying a trained and an unknown dialog and other decision metrics; • The results obtained in the deep learning approach are compared with a machine learning approach proposed in [6], which computes the most probable SIP signaling patterns through a n-gram Hidden Markov Model. The comparison presented in this paper includes the SIP dialog prediction and detection probabilities, the classification of unknown dialogs, and the computation time. A comparison between the deep and the machine learning approaches evidence that deep learning halves the computation time. In addition, the prediction probability of the deep learning approach has an improvement of 17.04% over the HMM approach, presents a lower computation time, and achieves a detection probability of unknown SIP dialogs above 94%. Regarding the structure of the paper, we start to introduce related works in Section II. Section III introduces the system model and describes the proposed LSTM models. Section IV presents the experimental results and the performance of the proposed solutions. Finally, Section V concludes the paper.

A. SIP SECURITY
The security of the SIP Protocol [7] has been analyzed in several papers [5], [8], [9]. It is well known that the different types of attacks can lead to service interruption or destruction and the undue consumption of SIP resources previously allocated for other purposes. A significant amount of works on SIP anomaly detection is formed on malformed SIP messages and their possible consequences in terms of effective SIP attacks. Several techniques were already proposed to cope with malformed and malicious SIP messages, including the use of firewalls capable of detecting intrusion [10], specific learning schemes [11] and comparative approaches based on the statistics of different patterns [12]. Another source of SIP attacks is related to the SIP authentication schemes [13]. Several authentication schemes have been proposed for SIP including multiple-factor authentication methodologies [14]. Flooding attacks also constitute a representative number of service interruptions. Multiple solutions were already proposed to minimize the effects of the flooding attacks, including threshold-based schemes that identify the attacks by comparing the SIP traffic patterns with previous traffic patterns occurring during normal SIP operation [15]. SIP parser vulnerabilities are also a desirable target for attackers. In this case, the SIP messages can be modified to decrease the efficiency of the servers and/or even block the processing and memory resources, and the solutions to mitigate these types of attacks are usually based on prior classification of the receiver SIP messages before being parsed by the servers [16].
The kind of SIP vulnerabilities explored in this paper is related to the SIP signaling logic, where malicious users can explore the diversity of SIP systems to take advantage of defective implementations [5]. The SIP signaling vulnerabilities have been considered in [17], where the authors have proposed a debugger tool to analyze the flow of received SIP messages to be further categorized into groups of compliant dialogs and non-compliant ones. The scheme proposed in [3] to mitigate SIP signaling attacks is based on the contextual information of the SIP traffic, similar to the solution proposed in [4], where the interaction of the SIP peers and their specific timings are compared to prior data to identify significant deviations.
In our work, we are motivated by the latest advances in deep learning tools. Recently, the use of machine learning and deep learning techniques brought a plethora of unprecedented innovations. Learning was adopted in [18] to classify IP traffic based on the statistics of different flows. The work [19] has proposed an unsupervised detection scheme of spam over internet telephony. VoIP systems were also the main target of the works presented in [20], [21], where deep learning systems were proposed for the detection of possible steganography in VoIP streams. Deep learning was also used in [22] to detect VoIP traffic in tunneled and anonymous networks, in [23] to identify if voice calls were originated from VoIP systems or cellular/fixed voice networks, and in [24], [25] to assess the quality of VoIP calls.
When compared to the works in [3], [4], [17], our work is not considering a fixed probabilistic model of the SIP operation neither fixed rules that describe the SIP interaction. Our goal is to devise an automatic detection and prediction system based on deep learning, that is capable of detecting known and unknown SIP dialogs. While known SIP dialogs are already labeled and their level of vulnerability can be computed based on prior knowledge, the detection of unknown SIP dialogs is of high importance to detect novel attacks.

B. SIP PROTOCOL
The SIP protocol was proposed for signaling multimedia sessions established by multiple peers. The signaling is implemented through the exchange of SIP messages. To initiate an interaction a peer sends a SIP request message containing the indication of its type through the SIP method field in the SIP message header. The peer receiving the SIP request answers with a SIP response message that includes a reply code in the message header. A SIP request exchanged between SIP peers forms a SIP transaction that includes the SIP request and any responses to it. A SIP dialog is formed by multiple SIP transactions and represents the sequence of SIP signaling operations exchanged between the SIP peers over time. Each SIP dialog is unequivocally identified through the SIP Call ID field in the message header. In this work, we assume that the peers and the SIP servers forming the path between the peers have access to the SIP messages exchanged by the peers and can read the headers of the SIP messages to identify the Call ID and the type of the SIP requests and responses capable of characterizing a specific dialog.

III. DEEP LEARNING MODEL FOR SIP SIGNALLING PATTERNS CLASSIFICATION
This section describes in Subsection III-A the recurrent neural network models adopted to predict and detect SIP dialogs. The detection of unknown dialogs is based on statistical classification models and it is described in Subsection III-B.
In the proposed approach we consider that the exchanged SIP messages, denoted by m k , are captured be a SIP server or SIP peer that runs the detector/estimator scheme, creating an observed sequence of SIP messages over time, denoted by n k ∈ X , that is used as the input of the learning model. Using the Call ID information contained in the header of each SIP packet to identify the SIP dialog, the model uses the sequences in the observed sequence n k to predict or detect the most probable SIP dialog identifier y k ∈ Y, where Y denotes the output state space of the predictable SIP dialogs. The output y k is compared with statistical information previously collected to validate the proposed model. A table of symbols is given in Table 1 to introduce the notation adopted in this section.

A. LSTM RNN MODELS
Two LSTM RNN models were identified in an iterative fashion to predict and detect the most likely SIP Dialog identifier. The first LSTM RNN model, illustrated in Figure 1(a), comprises one LSTM layer and a Dense layer. The LSTM model was chosen due to its ability to process temporal sequences. LSTM models are recurrent models, meaning that whenever a new element of the input sequence is processed the model always takes into account the previous elements of Length of an observation o k .

LN
Length of a pad sequence n k .

LM
Length of the encoded SIP message m k . n Number of zeros added into the pad sequence.

H1
Hypothesis 1 (classifier detects an unknown dialog). µS Mean of the skewness of the trained dialogs. µK Mean of the kurtosis of the trained dialogs. the sequence. After the LSTM layer, an output Dense layer is used to decode the LSTM output into the most probable SIP dialog. In the second model, represented in Figure 1(b), two LSTM layers are considered to increase the number of degrees of freedom to identify other existing relations between each sequence being trained. Furthermore, to prevent that each model becomes overfitted we have used a dropout probability block and an early stop training condition.
Before describing the structure of the LSTM RNN models, we introduce how the SIP protocol was modeled and the definitions required to describe the SIP dialog prediction and detection. Considering the characteristics of the SIP protocol and how the multimedia sessions are created, the proposed scheme for prediction/detection of the SIP dialogs can be applied by the SIP user agents or in the SIP servers traversed by the SIP messages. Definition 1. A SIP message carried in a SIP packet and denoted by m k , k ∈ M = {1, 2, ..., M }, is a SIP request or SIP response of a specific type. We adopt the symbol M to represent the total number of SIP request plus responses. Finally, M represents the set of the possible types of SIP requests and SIP responses.
A SIP message can be formed either by a numerical code representing the type of the SIP response or a text field indicating the type of the SIP request. To use the type of the SIP message as an input of the learning process, we encode each SIP message m k using the One Hot Encoder algorithm VOLUME X, 20XX Definition 2. An encoded SIP message m i is represented by a Boolean vector that univocally identifies the type of the SIP message m i . The encoded message is obtained using a One Hot Encoder [26]. The Boolean vector as length L M .
The SIP messages are exchanged for a given purpose originating different transactions. A SIP dialog is completed when the multimedia session created by the peers of user agents is terminated.
where m (j) represents the j-th encoded message of the sequence. The length of the SIP dialog is represented by L d . The SIP messages forming the SIP dialog contain the same Call ID string as well as the sender and receiver addresses in the packet's header.
Although a SIP dialog is only defined when all SIP messages are exchanged, it is assumed that the model can estimate the dialog when only part of the SIP dialogs' messages have been exchanged. Therefore, instead of considering only sequences with length L d , the model can process their subse- The length of the pad sequences is denoted by L N .
Until now, we considered that an observed sequence is formed only by SIP messages. However, with the transformation proposed in Definition 5, a padding symbol is added to each o k . Thus, besides the encoding of each SIP message, the padding symbols are also encoded according to Definition 2. Therefore, the length of the encoded SIP message m k is L M = M + 1 to account for every type of SIP message (M ) and the zero-padding symbol.
Next, we describe the input and output state spaces used in the learning and prediction/detection of the SIP dialogs. The computation of the predicted SIP dialog is equivalent to the regression problemŷ k = f (n k , β), where the estimate function f (.) is defined by interactively computing the weights of the LSTM (β) during the training period. Once trained, the LSTM neural network identifies how close the observation n k is to each dialog in the output space. Depending on the observation length L o the model is either Finally, the training steps of each LSTM RNN model are described in Tables 2 and 3. Step 1: An input sequence n k of length 1 × LN × M is generated by the One Hot Encoder and the Pad Sequence.
Step 2: The LSTM layer processes the encoded SIP message m k of the padded sequence n k and returns a 1 × N sequence, h0, of real numbers in [−1, 1].
Step 3: The model discards the LSTM outputs with probability P .
Step 4: The Dense layer receives the outputs from the Dropout block and generates an output vector of length 1 × N of real numbers in [0, 1]. Step 1: Similar to Step 1 in Table 2.
Step 2: The LSTM layer processes each encoded SIP message m k of the padded sequence n k and returns a 1 × LN × N sequence h0 of real numbers in [−1, 1].
Step 4: Similar to Step 3 in Table 2.

B. UNKNOWN SIP DIALOGS DETECTOR
During the training of the LSTM RNN models they acquire the ability to differentiate each SIP dialog, denominated trained dialog. However, when a SIP dialog that was never seen during the training stage, denominated as unknown dialog, is copied to the inputs of the LSTM RNN model, the neural network also generates output values. The detection methodology to identify unknown dialogs is based on the statistical properties of the LSTM RNN output values and the rationale behind the detection is based on the statistical dissimilarity of the outputs when the input is a known or an unknown SIP dialog.
The first classifier detects possible anomalies by looking into the maximum value of the LSTM RNN model outputs. Whenever a sequence is predicted the maximum output value is compared with the average of the maximum values obtained for the trained/known dialogs. Depending on the maximum value of the outputs the classifier decides if it is a trained/known dialog or an unknown dialog. In terms of detection, we define the hypotheses H 0 and H 1 . Considering that the hypothesis H 0 represents the detection of a known SIP dialog (previously trained) and the hypothesis H 1 represents the detection of an unknown SIP dialog, the classification of a predicted output sequence is stated as where λ M is the mean of the maximum value of the N LSTM outputs obtained for each dialog during the training stage and max(ŷ k ) represents the highest LSTM output value of the dialog to be classified.
In the second classifier, all outputs are used as a source of information for the classification of unknown dialogs. The second classifier is based on statistical metrics computed from the outputs of the LSTM RNN neural network, particularly the skewness and kurtosis standardized central moments. Therefore, whenever a SIP dialog is detected the skewness and the kurtosis of the LSTM RNN outputs are computed and compared with the thresholds given by λ S = µ S − σ 2 S and λ K = µ K − σ 2 K , respectively. The variables µ S and µ K represent the mean of the skewness and kurtosis of the LSTM RNN outputs obtained for the trained dataset and σ 2 S and σ 2 K denote their variance, respectively. Thus, as in the previous classifier, whenever a sequence is classified two hypotheses are tested, representing a trained dialog (hypothesis H o ) or an unknown dialog (hypothesis H 1 ). The hypotheses are written as where Skew(ŷ k ) and Kurt(ŷ k ) represent the skewness and kurtosis of the LSTM RNN outputs of the SIP dialog to be classified.

IV. PERFORMANCE EVALUATION
In the following subsections, we evaluate the performance of the proposed LSTM RNN models to predict or detect a SIP dialog formed by a sequence of observed SIP messagesobjective (a). Furthermore, considering that unknown dialogs might be observed, we evaluate the performance of the detectors proposed in Subsection IV-B to classify them -objective (b). Both objectives are important to detect signaling SIP attacks. In objective (a) the model is capable of classifying the dialogs it already knows and labeled as safe, anomalous, or according to different vulnerabilities ranks. Consequently, the importance of a model with a higher detection and prediction performance can be leveraged to obtain more accurate results with regards to the classification of safe or harmful dialogs previously known. Additionally, with objective (b) the model gains the ability to recognize if the observed SIP dialog was considered during the training stage or if it is unknown, the latter representing the case when it should be analyzed by an expert domain to assess its vulnerability level.
Regarding the organization of this section, Subsections IV-B and IV-C evaluate the objective (a) by characterizing the classification performance of the SIP dialogs already trained so far. Subsection IV-D addresses the objective (b) by evaluating the capability of detecting unknown SIP dialogs. The experimental methodology is presented in Subsection IV-A.

A. EXPERIMENTAL METHODOLOGY AND DATASETS
To evaluate the performance of each model in the following experiments we adopted the SIP dataset created by Nassar et al. [27]. The dataset was selected to enable the comparison between the proposed LSTM RNN models with the VOLUME X, 20XX performance obtained using a n-gram Hidden Markov Model (HMM) described in [6].
The SIP dataset is composed by two datasets: one for the non-anomalous dialogs, and another one for the anomalous dialogs. The non-anomalous dataset contains 18782 SIP dialogs created by 249 user agents. The 18782 dialogs correspond to a total of 1492 unique SIP dialogs in which 66.23% only occur once. Furthermore, each dialog is formed by a combination of a maximum of 17 types of unique SIP messages and the length of the combination varies between 3 and 56. As in [6], we have considered the non-anomalous dataset for training and testing the LSTM RNN model prediction and detection performance. The non-anomalous dataset was divided into a training and test datasets in a proportion of 80/20. The test dataset contains the last 20% of the dialogs exchanged by each user and the training dataset contains the remaining 80% of the dialogs. Regarding the anomalous dataset, it contains 152 unique SIP dialogs representing possible attacks.
Some of the LSTM RNN topological parameters are based on the distribution of the dialogs of the training dataset. The number of unique SIP dialogs (N ) used in the training stage is 1043 (not the 1492 in the entire non-anomalous dataset due to the 80/20 proportion). Therefore, some dialogs are only contained in the test dataset (more precisely 449 dialogs) and are not used during the training stage. Besides the value of unique SIP dialogs N , the remaining parameters adopted in the LSTM RNN model are described in Table 4. The LSTM RNN models were implemented in TensorFlow 2.0 running in a 64bit Ubuntu 20.04 OS system over an Intel Core(TM) i5-5200U CPU @ 2.20GHz with 8 GB of RAM and a GeForce 840M GPU.  Table 5 presents the achieved detection probability. The results indicate that the LSTM RNN models achieve a similar detection probability. Although the proposed models 1 and 2 are capable of detecting all SIP dialogs of the training dataset, some of the SIP dialogs of the test dataset are not detected because they were not included in the training dataset. Regarding the detection probabilities, we observe that they are identical to the results achieved with the HMM approach with the MFdC criteria proposed in [6]. The computation time required to classify the SIP dialogs is of high importance since it represents the time required to run the LSTM RNN model during the detection of SIP dialogs. To show the potential of the proposed solution we measured the amount of time each LSTM RNN model needs to compute the output for a given observed sequence. The Cumulative Distribution Probability (CDF) of the computation time required to detect each SIP dialog in the nonanomalous dataset is represented in Figure 2 for models 1 and 2, which are compared with the times obtained with the detection scheme proposed in [6] (HMM). The results indicate that the HMM model achieves lower computation times for approximately 50% of the dialogs in the dataset. However, the times for the remaining dialogs are much higher than the times achieved with the LSTM RNN models. For the HMM approach, the detection of the SIP dialogs is based on the Viterbi algorithm that computes the most probable SIP dialog according to the number of observed SIP messages. Afterward, a backward search is used to obtain the output sequence of the model. Thus, the complexity of the algorithm is O γM Lo N 2 , where γ denotes the length of n-gram sequence presented to the HMM model. As described in the computational complexity expression, the computation time of the HMM is a function of the observation length, and lower computation times are achieved for shorter observations and vice-versa. Regarding the LSTM RNN models, the decision is made through a mapping function that does not depends on the observation length, i.e., O(1). Finally, the average computation time is 240 ms, 116 ms, and 217 ms for the HMM model, LSTM RNN model 1, and LSTM RNN model 2, respectively. The difference between the two LSTM RNN models is related to the number of parameters they have.
As the LSTM RNN model becomes more complex, the time needed to compute the output increases, as observed for the LSTM RNN model 2.

C. PREDICTION PERFORMANCE
This subsection evaluates the models' ability to estimate the most probable SIP dialogs when the observed sequence is still being transmitted, i.e., as the SIP dialog evolves. To this end, the LSTM RNN models were retrained, using a prediction dataset (during 120 epochs for model 1 and 57 epochs for model 2). The prediction dataset is based on the dataset used during the detection. However, instead of considering that there is only one observed sequence per SIP dialog we have considered L d SIP sequences that represent the subsequences from the instant the first SIP message is observed until the growth of the observed sequence reaches the L d SIP messages that form the SIP dialog. Therefore, each SIP dialog is decomposed in the following observed (1) , m (2) , ..., m (L d ) >. As a consequence, the 18782 dialogs of the non-anomalous dataset are decomposed into sequences that create the prediction dataset. Thus, the train and test datasets are formed by 132855 and 43503 sequences, respectively.
As in the previous subsection, the performance of the LSTM RNN models is evaluated through the probability of the models' outputŷ k be identical to the correct SIP dialog identifier y k . However, unlike the previous subsection, the prediction probability (P E ) is computed for the observed sequences with L o < L d , while the detection probability considers only the observed sequences with L o = L d . Table  6 presents the prediction probability P E of each LSTM RNN model in the training and test datasets. The results indicate that the prediction performance of the LSTM RNN models is higher than the one obtained for the HMM approach. The different prediction probabilities between the LSTM RNN and the HMM models are related to how the observed sequences are transformed in these two approaches. In the case of the LSTM RNN model, each observed sequence o k is stuffed with zeros at the end to form a fixed-length pad sequence n k , where all their SIP messages are orthogonalized through the One Hot Encoder. In the HMM approach, the observed sequence is also stuffed with zeros. But in this approach, the zeros are placed at the beginning and at the end of the observation, and no guaranty of orthogonality with all the different sequences is assured, thus it is expected to achieve a lower performance.
Regarding the prediction performance of each LSTM RNN model, we conclude that as the complexity of the model increases so does the prediction probability. The theoretical upper bound of the prediction probability is also indicated in Table 6. The theoretical value of P E is the result of the summation of the number of occurrences of the most frequent dialog for each observed sequence o k divided by the size of the dataset. Therefore, we conclude that the LSTM RNN results are closer to the theoretical upper bound than the ones obtained with the HMM approach.
Given that the prediction probability is computed considering that each SIP dialog is formed by L d subsequences (o 1 , ..., o L d ), in Figure 3 we plot the prediction probability conditioned by the number of received SIP messages (L o < L d ). The results show that the LSTM RNN model 2 outperforms the HMM with MFdC criteria when the number of observed SIP messages is between 1 and 11. Afterward, their probabilities are similar. Therefore, we conclude that the LSTM RNN model predicts more SIP dialogs for a lower number of received messages. Regarding the behavior of the prediction probability for the LSTM RNN model, its performance can be divided into two regions: L o < 15 and L o ≥ 15. In the first region, the prediction probability VOLUME X, 20XX gradually increases as the length of the observed sequences increases and, consequently, as the number of likely SIP dialogs in the output space decreases. However, when the number of received SIP messages is 15 the prediction performance decreases. The justification for the lower probability is related to the higher occurrence of the dialogs with L d ≤ 15, in comparison with the dialogs with L d > 15. Besides that, there is a higher number of unique dialogs of length above 15. Nevertheless, as the length of the observation increases so does the prediction probability. Furthermore, when L o = 31 the LSTM RNN model 2 can predict all the SIP dialogs with L d ≥ 32. Figure 4 presents the prediction probability conditioned to the amount of information received so far (L o /L d ), i.e., the length of each observation is normalized in respect to its dialog length. The results show the amount of information FIGURE 4: SIP dialogs prediction probability over the amount of available information (SIP messages). needed to predict each dialog. The curves were obtained by predicting every observed sequence from each SIP dialog d k .
The results show that the LSTM RNN model can predict each observation sequence with less available information than the HMM approach. The conclusion is supported considering that the prediction probability is much higher than the one obtained for the HMM model when L o /L d ≤ 42.86%. After that value, an identical performance is observed for both models. Additionally, the minimum amount of information needed to predict a SIP dialog is 3.571% and 10.71% for the LSTM RNN model 2 and HMM with MFdC, respectively. Finally, to predict 50% of all the SIP dialogs the LSTM RNN model 2 and the HMM with MFdC need approximately 64.29% and 67.86% of the available information, respectively.

D. DETECTION OF UNKNOWN SIP DIALOGS
Despite the advantages of the LSTM RNN models demonstrated so far in terms of the computation time and prediction of SIP dialogs the same cannot be concluded for the detection of unknown dialogs. An indication of the inability to detect unknown dialogs is represented in Table 5, where both LSTM RNN models assigned an incorrect SIP dialog identifier for 13.64% of the sequences from the test dataset. The reason for the SIP dialogs misdetection is due to the nonexistence of those dialogs in the input and output state space leaving the model to assign them the identifier of the most similar SIP dialog contained in Y. However, for the HMM approaches whenever an observed sequence different from the ones in the input state space is detected no output is returned from the prediction algorithm.
Regarding the proposed classifiers, presented in Subsection III-B, we seek to characterize its ability to detect the sequences from the anomalous dataset and the unknown dialogs included in the test dataset. The classification of each SIP dialog into a trained/known dialog or an unknown dialog is based on the statistical features collected from the output of the LSTM RNN model. In the first classifier, the statistical information collected is related to the maximum value of the LSTM RNN model output, which is depicted in Figure 5. In the figure, the 1 dimensional distribution is replicated in both axis and its data is differentiated according to its characteristics: anomalous dialog (anomalous dataset), unknown dialog (13.64% of the test dataset), and trained dialog (training and 86.36% of the test dataset). The results from the figure indicate that the trained dialogs have lower uncertainty and a higher maximum value in comparison with the other classes because the LSTM RNN model was trained to detect those dialogs. Figure 6 illustrates the computed threshold (λ M = 0.99985), and the classifier performance.
Regarding the classification performance there are four possible outcomes: the dialog was correctly classified as a trained dialog (true positive), incorrectly classified as a trained dialog (false positive), and classified as unknown dialog (true negative and false negative). According to the results illustrated in Figure 6, the classifier cannot completely separate the two classes. A reason to support the existence of false positives despite the higher threshold value can be related to the similarity between the unknown and the trained dialogs, which lead to a higher LSTM RNN model output values.
In the second classifier, the classification of the SIP dialogs is based on the skewness and kurtosis standardized central moments of the LSTM RNN model. The statistical information collected for the proposed classifier is represented in Figure 7 through the normalized skewness and kurtosis. As in the previous classifier, the selection of the threshold value took into consideration the distribution of the trained dialogs, especially their average and variance. Figure 8 illustrates the classifier performance and its threshold values computed with the proposed detection model (λ K = 0.99967 and λ S = 0.99973). Similar to the previous classifier, the skewness and kurtosis threshold-based classifier cannot differentiate the two classes. The justification for missing the detection of some dialogs is identical to the one presented for the maximum value threshold-based classifier.
Finally, Table 7 presents the performance metrics to compare the proposed classifiers. The metrics are based on the four possible outcomes already presented (confusion matrix): true positive, true negative, false positive, and false negative. Considering the results from the confusion matrix only, we observe that the maximum value threshold-based classifier is the one that correctly classifies more trained dialogs, while the skewness and kurtosis threshold-based classifier distinguishes more unknown dialogs. The results previously stated are also validated through the sensitivity and specificity metrics, since the former quantifies the probability of correctly classifying a trained dialog considering all the trained dialogs, while the latter represents the probability of correctly classifying an unknown dialog considering all the unknown dialogs. Regarding the precision and accuracy of the classifiers, the second classifier achieves higher precision, while the first one has higher accuracy. The f1-score metric, used when the results are obtained from unbalanced data and the classifier's outcome is binary, exhibits a higher score for Classifier 1. Thus, showing that the performance achieved by the two classifiers are too close to each other and the superiority of each one effectively depends on the considered performance metric.

V. CONCLUSIONS
A deep learning approach is proposed in this work to detect and predict known and unknown SIP dialogs. The proposed solution is based on a LSTM neural network, which can predict and detect SIP dialogs already observed so far. Two detectors are also proposed to detect SIP dialogs never observed before. Adopting a publicly available SIP dataset, we have assessed the performance of the proposed classifier and detectors. Several performance metrics were evaluated, including the detection and prediction probabilities and computation time. Moreover, the experimental results were compared to a probabilistic-based solution, showing that the proposed methods achieve higher SIP dialogs detection scores in a shorter time. Moreover, the detection probability of unknown SIP dialogs is above 94%, indicating the capability to detect a significant number of unknown SIP dialogs in a short amount of time.