Adversarial Data Augmentation for HMM-Based Anomaly Detection

In this work, we concentrate on the detection of anomalous behaviors in systems operating in the physical world and for which it is usually not possible to have a complete set of all possible anomalies in advance. We present a data augmentation and retraining approach based on adversarial learning for improving anomaly detection. In particular, we first define a method for generating adversarial examples for anomaly detectors based on Hidden Markov Models (HMMs). Then, we present a data augmentation and retraining technique that uses these adversarial examples to improve anomaly detection performance. Finally, we evaluate our adversarial data augmentation and retraining approach on four datasets showing that it achieves a statistically significant performance improvement and enhances the robustness to adversarial attacks. Key differences from the state-of-the-art on adversarial data augmentation are the focus on multivariate time series (as opposed to images), the context of one-class classification (in contrast to standard multi-class classification), and the use of HMMs (in contrast to neural networks).

robotic systems and Cyber-Physical Systems (CPSs) is that they are often related to (dynamical) behaviours of the whole system, rather than to specific components (e.g., a specific sensor or actuator).Hence, they can be detected from observations acquired by multiple sensors and spanning some time intervals, rather than from single instantaneous observations.This requires the analysis of multivariate time series to identify behavioural anomalies.The literature about detectors for anomalous (dynamical) behaviours is still in its early stages and further efforts are needed to meet the requirements of autonomous robotic systems and CPSs.
In this work we focus on Hidden Markov Models (HMMs) [2] as they represent a powerful and widely used mathematical model for learning robot behaviour [3] and for encoding noisy time series [4].Moreover, HMMs are known to be robust to spatio-temporal variations and can successfully model intelligent systems behaviours in several LTA contexts, where similar sequences of actions (tasks) are typically repeated multiple times [5], [6].In particular, in [7] an online approach has been proposed for detecting anomalous behaviours of robot systems involved in complex LTA scenarios.The methodology uses HMMs to model the nominal (expected) behaviour of a robot and the Hellinger distance [8] to evaluate the dissimilarity between the probability distribution of subsequences of observations (i.e., multivariate sensor time series) in a sliding window and the emission probability of the related HMM hidden states.The advantage of using such a distance measure instead of standard measures (e.g., the likelihood of observation subsequences) is twofold: first, the Hellinger distance is bounded and thus lends itself to simpler interpretation and thresholding; second, it is less noisy, hence more informative and discriminative [7].For simplicity, in the following we refer to the online algorithm proposed in [7] as HMM-Hellinger-based Anomaly Detector (HHAD).
A key issue for behavioural anomaly detection in robotics and CPSs is the lack (or paucity) of anomalous examples and the noise that characterizes data in these contexts.To address this issue, in this paper we propose an adversarial data augmentation and retraining approach for HHAD (called HHAD-AUG).Following the recent promising trends of adversarial example generation and adversarial attack generation for machine learning models [9], we base our data augmentation method on adversarial examples, namely, perturbed time series [10], [11], [12], [13], that have the advantage of not requiring any prior knowledge about the application domain and data conformation.In particular, we generate adversarial examples for nominal points, the only knowledge available at training time, using two algorithms.The first algorithm (called H-ADV) uses a loss function 1 based on the Hellinger distance between the observed and the expected data distributions (i.e., model parameters) and the second (called L-ADV) uses a loss function based on sample likelihood.We mathematically derive the gradients of both loss functions and present a procedure to use them to augment the original dataset.We then use adversarial data augmentation to improve the detection performance of HHAD (in terms of F1-score) when limited amounts of training data is available, but we show that the methodology we propose achieves performance improvements also when models are trained on large datasets.Notice that we focus on anomaly detection as a one-class classification problem, in which examples of the anomalous class are not considered in the learning phase, hence the only way to perform data augmentation is to generate new nominal examples.
In contrast to the literature on generating adversarial examples, we focus on time series rather than images and we consider HMMs rather than deep neural networks.Recently, an approach for performing adversarial attacks on (univariate) time series classifiers was proposed [12], but it is based on neural network classifiers, hence it requires large datasets.Our method employs the definition of adversarial attacks for time series used in [12] but it is oriented to HMM-based anomaly detectors, which achieve strong results on multivariate time series also on small datasets [7].Moreover, the proposed method focuses on anomaly detection as one-class classification, a key difference with respect to [12].
We evaluate our data augmentation and retraining approach on four public datasets, three known real-world benchmarks for anomaly detection in robotic systems and CPSs, i.e., Tennessee Eastman [14], SWaT [15], and ALFA [16], and one created by the authors using aquatic drones developed in a EU H2020 project, i.e., INTCATCH [17].The experimental evaluation of the proposed approach shows that (i) H-ADV and L-ADV can generate meaningful adversarial examples for HHAD; (ii) HHAD-AUG can employ these new examples to significantly improve the performance of HHAD; (iii) using examples from both H-ADV and L-ADV outperforms state-of-the-art augmentation methods; (iv) using examples generated by H-ADV is better than using examples generated by L-ADV (i.e., the former wins on three datasets and ties on one), hence we consider H-ADV as the best method to generate adversarial examples for data augmentation of HHAD; (vi) the low computational complexity of H-ADV and the high parallelizability of L-ADV allow for a fast data augmentation and retraining of HHAD.The generated examples are guaranteed to have small distance from the original examples, according to the definition of adversarial examples for time series [12].In summary, the main contributions of this work to the state-of-the-art are the following: r we propose an algorithm able to generate adversarial ex- amples for a one-class anomaly detector based on HMMs and working with multivariate time series (Section IV-A); 1 Throughout the manuscript we use term "loss" as a synonym of "anomaly score".r we propose an algorithm for data augmentation based on adversarial examples which improves the performance of the anomaly detector (Section IV-B); r we evaluate, with positive results, both adversarial gener- ation and data augmentation on four datasets of multivariate sensor signals acquired from autonomous robots and industrial CPSs (Section V).

II. RELATED WORK
Three research topics are mainly related to our work: data augmentation, adversarial example generation, and anomaly detection in autonomous robots and CPSs.
Data Augmentation: Data augmentation techniques are used to increase the amount of available data by adding slightly modified copies of already existing data with the aim of regularizing and reducing overfitting when training a machine learning model.In [18], data augmentation methods for images are surveyed.The main goal of data augmentation, both for images and for other types of data, is to prevent class imbalance and model overfitting due to data limitations, by adding synthetic examples to available datasets [19].
Time series data augmentation is not an established practice.The majority of the state-of-the-art approaches for time series do not use data augmentation, and the first surveys on these techniques have been presented only recently [20], [21].Methods related to adversarial training are still not considered in the literature related to time series data augmentation.Most data augmentation techniques for time series [20], [21], [22], [23], [24] are instead based on random transformations, such as, addition of random noise, slicing, cropping, scaling, random warping in the time dimension, and frequency warping.
Adversarial data augmentation, also called adversarial training [25], is the process of augmenting a dataset using adversarial examples (a.k.a.adversarial attacks) to achieve two main goals, namely, making classifiers more robust to adversarial attacks and reducing their test error on clean inputs, i.e., improving the accuracy of the classifier on the test set.This practice has been very recently applied to image classifiers [18], where adversarial data augmentation has been used to improve generalization to unseen domains.In [26], an iterative procedure is proposed to augment a dataset with examples from a fictitious target domain that is hard under the current model.The methodology is inspired by recent developments in distributionally robust optimization and adversarial training.For instance, in [27], a good performance under adversarial input perturbations is guaranteed by considering in the learning optimization problem a Lagrangian penalty for perturbing the data distribution in a Wasserstein (a distance over probability distribution) ball.In [28], it is proposed a novel regularization term for adversarial data augmentation in deep neural networks for image classification.The methodology extends [26] and reduces to it when the maximum-entropy term is discarded.These approaches show that adversarial training can effectively help when searching for augmentations [18].
Our method applies these principles to a specific one-class classification problem (i.e., anomaly detection), for time series data instead of images.We apply, in particular, adversarial data augmentation to the HMM-based detector presented in [7].The adversarial-based strategy that we use to generate synthetic examples makes our methodology deeply different from traditional time series augmentation methods, since we do not need prior knowledge about the application domain to generate data transformations.Furthermore, our strategy allows to improve detection performance on both the original test set and the adversarial attacks generated from the test set, hence it improves also the robustness to adversarial attacks.
Adversarial Example Generation: Adversarial examples for classification models are investigated in [29] and the analysis is specialized to neural networks for image classification in [30], where authors notice that "imperceptible non-random perturbations applied to a test image can change the network prediction".An optimization procedure called box-constrained Limitedmemory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS) is proposed in [30] to compute adversarial perturbations of images given network parameters.To overcome some time-complexity issues of this method, another approach called Fast Gradient Sign Method (FGSM) has been proposed [9].It produces sub-optimal adversarial examples, in terms of distance from the original example, but being very fast it has quickly become popular and has inspired other approaches.Two methods that produce smaller perturbations than FGSM while ensuring good efficiency are Deepfool [31] and the Carlini-Wagner method [32], that use iterative procedures based on local linearization of the classifier function.A theoretical framework for analyzing the robustness of classifiers to adversarial perturbations is proposed in [33].Adversarial training is used as a regularization method for supervised and semi-supervised learning of neural networks in [34].A method for generating universal adversarial perturbations is presented in [35].Surveys on adversarial attack methods are proposed in [36], [37].
Methodologies for generating adversarial attacks on time series are proposed in [10], [11], [12], [13].A strategy based on Adversarial Transformation Networks (ATNs) is used, in particular, in [11], [12] to generate adversarial attacks on a target classifier of time series via a student model trained using standard model distillation techniques.The target classifier can be a fully convolutional neural network or a 1-nearest neighbor classifier with Dynamic Time Warping.The ATN takes in input a time series and its gradient with respect to the softmax-scaled logits of the target class predicted by the attacked classifier, and returns a perturbed time series that represents a possible adversarial example.If the classifier being attacked is unknown (i.e., black-box attack) or it is 1-nearest neighbor with Dynamic Time Warping (i.e., white-box attack on a non derivable classifier), then the gradient cannot be computed.In these cases, the attack is performed on the student model which is a neural network that imitates the classifier and it is derivable.In [13], ATNs are extended with autoencoders to attack multivariate time series classification models.
In our work, we consider the same definition of adversarial attacks used in [11], [12], [13] but the approach we propose has a different objective and uses a different methodology.We propose a data augmentation technique based on adversarial examples for improving HMM-based anomaly detectors that work on multivariate time series, while [11], [12], [13] propose new methodologies for generating adversarial attacks on time series.We derive the gradient of the specific loss function of the anomaly detector, hence our method generates adversarial examples directly on the detector, not on a neural network that approximates the detector.Also the target model is different: in our case it is an anomaly detector trained using only nominal examples, while in the literature examples of all classes are considered to be available both in the training phase and in the adversarial attack generation phase.The generation of adversarial attacks has been studied also in the context of Natural Language Processing [38], [39], where classification models are sometimes similar to those used for time series classification.However, to the best of our knowledge all the methodologies proposed so far work with deep neural network models.
Anomaly Detection in Autonomous Robots and CPS: Anomaly detection approaches for robotic systems and CPSs can be divided into three main categories: model-based, knowledgebased, and data-driven.Online data-driven methods are getting more and more popular for autonomous robots and modern industrial CPSs.Only few methods deal with anomalies in system behaviours.Chandola et al. [40] refer to this type of anomalies as contextual faults since they are originated by observations in specific contexts.In other words, an observation can be considered anomalous in a specific context but nominal in another context [41], [42].Recently, an approach has been proposed for detecting anomalous process behaviours generated by stealthy-attacks in CPSs and, in particular, in industrial control systems [43].Other approaches for the same application domain have been recently presented [44], [45], some of them based on deep learning [46], [47], [48], [49], [50], [51], [52].The anomaly detector that we aim to improve by means of data augmentation, called HHAD in this paper, considers contextual anomalies (namely, observations that are infrequent in specific contexts) but differs from other approaches in the literature because it represents contexts as maximally frequent HMM states in a time window instead of as sets of past observations [7].The approaches that most resemble HHAD are [53] and [54], in which HMMs are trained using multimodal sensory signals for detecting anomalies in assistive robots.At run time, the trained HMMs provide likelihood scores for data inside a window, which are compared to an adaptive detection threshold to identify putative anomalies.HHAD substitutes the likelihood estimation used in these approaches with the Hellinger distance which is a more informative and interpretable measure.While other probabilistic distances [55] have been recently proposed for anomaly and change detection, we focus on the Hellinger distance since it is part of the original HHAD approach [7] and, as discussed there, it is bounded and little noisy.

III. BACKGROUND AND NOTATION
In this section, we informally define the problem addressed in this paper, the HMMs and Hellinger distance, and we describe the HHAD algorithm.The main strategies for adversarial example generation and data augmentation are finally presented.

A. Problem Definition
Given the online one-class HMM-based anomaly detector HHAD, introduced in [7], and trained on a dataset of nominal multivariate time series, our goal is to improve its performance by augmenting the training set with adversarial examples.Moreover, we aim at improving the robustness of the detector to adversarial attacks.

B. Hidden Markov Models
We use HMMs [2] as a probabilistic model for the system that generated a given d-dimensional time series X = x 1 , . . ., x n where x t = [x 1 t , . . ., x d t ], t ∈ {1, . . ., n}, is the multivariate observation x t at time t.An HMM is a statistical model in which the system being modeled is assumed to be a Markov process with K hidden states.The mathematical notation λ = {π, A, B} is used to represent an HMM, where π is the set of state transition probabilities (i.e., a ij is the probability to move from state s i to state s j ), and B = {b i (x t )} K i=1 is the set of the probability distributions over observations in each state (emission probabilities).In our setting, we assume a multivariate Gaussian distribution for the emission probabilities, which means that , where μ i and Σ i are the mean and the covariance matrix for state s i , respectively.To find the parameters of the HMM λ that maximize the fit (likelihood) to an observed (sub)sequence X we use the Baum-Welch algorithm, while to compute the optimal HMM state sequence (known as Viterbi path) that best explains a given observed (sub)sequence X we use the Viterbi algorithm.The number of hidden states and the covariance type of an HMM can be found by minimizing the Bayesian Information Criterion (BIC), which finds a trade-off between maximizing the likelihood of the training data with respect to. the model and minimizing the number of parameters required (i.e., the number of hidden states) [56].

D. The Online Anomaly Detector HHAD
The nominal behaviour of the system whose anomalies should be detected is modeled as an HMM λ N that is trained using the Baum-Welch algorithm from a sequence X of observations collected during the nominal functioning of the system.The number Algorithm 1: Anomaly Detection (HHAD) [7].
of hidden states and the covariance type are selected by minimizing the BIC.Online anomaly detection at time step t is performed by means of a sliding window W t = x t−w+1 , . . ., x t of length w (columns) and with observations that contain d variables (rows), hence W t ∈ R d×w .For each example W t , an anomaly score is computed and, when the score exceeds a predefined threshold τ , the behaviour is considered anomalous.The score is the Hellinger distance between the estimated distribution of the observations corresponding to the state ŝt occurring most frequently in the Viterbi path S t = s t−w+1 , . . ., s t of window W t and the emission probability of state ŝt of λ N .See details in [7].
HHAD, formalized in Algorithm 1, receives the current window of multivariate data W t , the nominal HMM λ N with its number of hidden states K, the window size w and a threshold τ ∈ R + .The returned value is a label a t ∈ {0, 1} representing the class of W t , namely, a t = 0 for nominal examples or a t = 1 for anomalous examples.Algorithm 1 is run for each window W t that has to be classified.It first computes the Viterbi path of the multivariate time series in W t (line 1).For the state ŝt occurring most frequently in the Viterbi path (line 2) a multivariate Gaussian distribution N (μ, Σ) is fit through maximum likelihood to the corresponding data observed in the window (lines 3-5).Then the Hellinger distance is computed (using (1)) between N (μ, Σ) and the emission probability of state ŝt in λ N (line 6).If the distance is larger than τ , then an anomaly is reported (line 7).Otherwise, label 0 is returned (line 9).Usually this algorithm is run with a sliding window W t which shifts of one time instant at a time.

E. Adversarial Example Generation
A formal definition of (minimal) 2 adversarial perturbation for a q-dimensional object ξ is provided in [31] as the minimal perturbation η that is sufficient to change the label ŷ = f (ξ) estimated by a classifier f : R q → {1, . . ., k}, q ∈ N for example ξ.The L p -distance of such a minimal perturbation is therefore This definition requires a distance metric L p = • p to quantify the similarity of the adversarial example to the original one.The most used distances are L 0 , i.e., the number of coordinates changed by the perturbation, L 2 , the standard euclidean distance between the original and the perturbed example, and L ∞ , the maximum change among all coordinates [32].Different distances L p have different impacts on the classification of the perturbed example.The usage of a distance metric that reliably captures the ground truth similarity is fundamental to generate good adversarial examples 3 .In the context of time series data, the ground truth of an example cannot be provided by human perception, hence it must be known in advance to evaluate the perturbed examples.Adversarial examples are then defined as slight perturbations of original examples that produce a misclassification with respect to the ground truth.Since the ground truth is not available for all examples, we use the definition of [11], [12], [13], namely, the label predicted by the classifier is assumed to be the ground truth and adversarial examples are defined as examples whose predicted class label is different from the ground truth label.Fast Gradient Sign Method (FGSM) [9] is the foundation for the approach we propose in this paper.It is untargeted, since it does not allow to specify the target label, and optimized for the L ∞ distance metric.This method does not guarantee to return the closest adversarial examples but it is faster than L-BFGS [30].Given an example ξ, FGSM searches for an example ξ = ξ + • sign(∇ ξ loss f (ξ, y)), where ∇ ξ is the gradient function over variable ξ.Given the parameters θ of the classifier f , an example ξ, its original class y in the training set, and the cost function used to train the classifier loss f , an adversarial example is obtained by linearizing the loss function around the current value of θ and moving (in R q ) in the direction that maximises the loss of f .The time performance can be improved by efficiently computing the gradient using backpropagation.The adversarial example generation we propose in this paper (Section IV-A) starts from examples ξ that are time series X and generates new examples ξ that are new time series X , with the goal of having X classified differently from X by the HHAD classifier f .

F. Data Augmentation
The data augmentation problem consists in extending a dataset by adding new examples to improve the performance of a model trained with that dataset.Standard approaches for time series data augmentation have been discussed in Section II.Since usually new examples are generated from the original ones, the problem can be formalized as the generation, from original examples ξ, of perturbed examples ξ + η that maximize the performance of model f .In the supplementary material, available online (Section 3.1) we describe four methods for time series data augmentation [21] that we use as baselines to evaluate the performance of our method.They are, Random Data Augmentation (R-AUG), Drift Data Augmentation (D-AUG), Gaussian Data Augmentation (G-AUG), and Synthetic Minority Over-sampling Technique (SMOTE) [19] (S-AUG).We select these methods as baselines because they are approaches that do not require specific domain knowledge, similarly to our adversarial example generation method.

IV. PROPOSED METHODS
We present two approaches for adversarial example generation on HHAD.The first uses a loss function based on the Hellinger distance; the second a loss function based on the likelihood.Then, we introduce our adversarial data augmentation methodology.

A. Adversarial Example Generation: H-ADV and L-ADV
Our approach for adversarial example generation is outlined in Fig. 1 and formalized in Algorithm 2 (Hellinger-based loss) and Algorithm 3 (Loglikelihood-based loss).As shown in Fig. 1, the main idea is to take an example4 X, i.e., a slice of the multivariate time series X, and to pass it to the adversarial example generator (Algorithms 2 or 3, called H-ADV and L-ADV, respectively, in the following) which uses some elements of HHAD (i.e., λ N and τ ) to generate the perturbation X + η.This perturbation is called adversarial if it actually changes the class of X, as done in the literature [11], [12], [13].
H-ADV: Algorithm 2 describes H-ADV, which uses a loss function based on the Hellinger distance.It receives five inputs, namely, i) a nominal example 5 , i.e., a slice of a multivariate time series X ∈ R d×w , where w is the length of the considered window and x t a d-dimensional observation at time t, ii) the nominal HMM λ N used by HHAD, iii) a parameter ∈ R + representing the maximum perturbation size for each element of X (which should be kept as small as possible), iv) the number of steps c ∈ N in which the interval [0, ] is divided, v) the threshold τ for the Hellinger distance used by HHAD.The algorithm returns an example X = X + η ∈ R d×w close to X (i.e., inside the hypercube with side 2 centered in X) and perturbed in a direction which facilitates the change of class, i.e., f (X ) = f (X).Remember that HHAD is a function f : R d×w → {0, 1}, where 0 is the class for nominal behaviours and 1 that for anomalous behaviours (see Algorithm 1).We do not consider adversarial examples generated from anomalous points since our goal is to augment the training set, which contains only nominal data.
Given the example X, whose class is assumed to be 0 (i.e., nominal example), Algorithm 2 first computes the direction of the perturbation as the sign of the maximum loss increment sign(∇ X H 2 (b N ŝt , N (μ, Σ))) (lines 1-7), following the strategy of FGSM.The classifier f is HHAD, hence it combines the application of the Viterbi algorithm and the threshold on the Hellinger distance between data distribution in X and distribution of the HMM emission model, to determine the class of the example.The loss function has in general a complex form in this case, because it depends on the maximally frequent state in X.The main obstacle in the computation of the derivative of the loss function is related to the solution of the inverse Viterbi problem, which is NP-complete [57].Intuitively, it is very complex to identify the point X of maximum increase of the loss function in the -neighbourhood of X because the reference emission model b N ŝt in that neighbourhood can change if the most frequent hidden state ŝt changes.To overcome this issue, we compute the gradient in X and assume it does not change in a small c -neighbourhood of X.Hence, we move in this neighbourhood following the direction of the gradient in X and then we iterate this procedure from the new point X reached from X (lines 9-27).Namely, in X we re-compute the gradient of the loss in X and we move according to it.The algorithm in this way adapts the gradient of the loss function to the reference emission model that can change inside the -hypercube with side 2 centered in X.The gradient of this loss function can be expressed in closed form when the covariance matrix is diagonal (see details in Section 2.1 of the supplementary material, available online).The process is iterated until the border of the hypercube is reached, or the Hellinger distance of the perturbed example starts to decrease, or the Hellinger distance exceeds the threshold τ (i.e., is X classified as anomalous).Not all the perturbed examples change their class; only the perturbed examples that change their class 6 are adversarial examples.
Fig. 2 provides a graphical overview of the strategy implemented by Algorithm 2. The algorithm computes the final perturbed X by iteratively performing two macro-steps.First, it moves in the direction of the gradient of the loss function.Second, if the reference emission model changes in the path, then it recomputes the gradient based on the new emission

Algorithm 3: Adversarial Example Generation Based on Likelihood (L-ADV).
Input: X ∈ R d×w ← original nominal example λ N ← nominal HMM ← maximum perturbation size Output: X ∈ R d×w : perturbed example 1 g = sign(∇ X P (X, λ N )) // likelihood gradient in X 2 X = X − • g // X update 3 return X model.The point is perturbed until it reaches the border of the hypercube or the decision boundary of the anomaly detector (i.e., a point in which the example is classified as an anomaly).In the picture the most frequent state in example X is ŝt = 1, hence the first perturbation is computed according to gradient ∇ X H 2 (b N ŝt =1 , N (μ, Σ)), that uses the emission model of the first hidden state as a reference.Then, a change of the most frequent state to ŝt = 2 occurs in X (1) , hence, the gradient is there recomputed according to the parameters of the emission model of that state, namely, ∇ X H 2 (b N ŝt =2 , N (μ (1) , Σ (1) )) where μ (1) and Σ (1) are computed in X (1) , and that gradient is followed from X (1) until the most frequent state changes again in X (2) to ŝt = 3.Again, the gradient is recomputed according to the emission model of state ŝt = 3 and it is followed until the decision boundary is reached in X (3) .That point represents the final perturbation of X, which is an adversarial example since X is classified as anomalous.
L-ADV: The second algorithm for adversarial example generation that we propose is applied to the same detector HHAD but it generates adversarial examples following the gradient of the likelihood of the example X instead of the gradient of the Hellinger distance.This algorithm for adversarial example generation is called L-ADV and is formalized in Algorithm 3. It avoids the mathematical calculation of the gradient of the loss function needed for the case of the Hellinger distance.With the likelihood, in fact, the loss does not depend on the maximally frequent state in the window, but it can be computed using the standard forward algorithm [58].The gradient of the likelihood can be recursively calculated.Note that Algorithm 3 does not require the c steps of the case of the Hellinger distance, because it does not matter if the maximally frequent state in the window changes or not.Given an observation X, first it computes the sign of the gradient of the likelihood of the example given the nominal HMM (line 1), namely sign(∇ X P (X, λ N )) (see details in Section 2.2 of the supplementary material, available online).Then it moves the example in the direction of the gradient for a step in each dimension (line 2).The algorithm returns the perturbed example X .Notice that the perturbed example X is an adversarial example only if the Hellinger distance between the maximally frequent observations in X and the emission model of the maximally frequent state in X is larger than threshold τ , namely, only if HHAD classifies X as anomalous according to its parameters λ N and τ .
Complexity analysis: The time complexity of Algorithm 2 is O(c•(wK 2 + wd)), where c is the number of steps per state, O(wK 2 ) is the computational cost of the Viterbi algorithm, and O(wd) is the cost of performing a gradient step (cost O(1)) on each dimension and time step of the window.The time complexity of Algorithm 3 is O(K 2 w 2 d).Being the gradient of the HMM likelihood not computable in closed form but only recursively, its time complexity is O(wK 2 ), which is considerably higher than the complexity of the computation of the gradient of the Hellinger distance, which is O(1) when computable in closed form.As a consequence, the computational complexity of Algorithm 3 is quadratic in the window length, which results in a considerable increase of the running time (i.e., it is ≈ 400 times slower than Algorithm 2).Fortunately, Algorithm 3 is quite parallizable, in fact, after re-implementing it in Cython we managed to achieve a running time similar to that of Algorithm 2 (see Section V).The complexities of the proposed algorithms are summarized in the supplementary material (Section 1 and Table 1), available online.
Remark: As discussed later, H-ADV empirically shows the best performance but its gradients are not derivable in closed form for HMM with non-diagonal covariance matrices of emission distributions.L-ADV should be employed in these cases.

B. Data Augmentation and Retraining: HHAD-AUG
Adversarial generation procedures H-ADV and L-ADV are here integrated in a technique for data augmentation and retraining called HHAD-AUG.Since the two adversarial generation procedures are used in a mutually exclusive way we refer to the data augmentation using H-ADV as H-AUG and to the data augmentation using L-ADV as L-AUG.Algorithm 4 formalizes the proposed approach.It aims at improving the performance of the anomaly detector HHAD and its robustness to adversarial attacks.Inputs are the nominal time series X used to train HMM λ N , the original HMM λ N (having K hidden states), the window size w, the loss function loss f (i.e., based on Hellinger distance or on likelihood), the threshold τ for the Hellinger distance, the maximum perturbation size , the number of steps c in which is split during adversarial generation, and the number of times The augmented dataset Ŵ is first initialized to the set of examples in the training set generated by covering the complete time series X with a sliding window of length w (line 2).Similarly, the augmented nominal HMM is initialized to the original nominal HMM (line 3) and the augmented threshold to the original threshold (line 4).Then the augmentation loop is iterated m times.The steps of this loop are described in the following.For each training example W t = x t−w+1 , . . ., x t in the training set W (line 7) a perturbed example W t is generated using algorithm H-ADV or L-ADV (lines 9 and 11) depending on the loss function loss f chosen for adversarial generation.The perturbed example W t is added to the augmented set Ŵ as a nominal example only if it is classified as an anomaly by HHAD (lines 13-15).We use label 0 for nominal Complexity analysis: The computational complexity of Algorithm 4 is O(m • (|X| − w) • ADV ), where ADV is the complexity of one of the adversarial generation algorithms, i.e., H-ADV or L-ADV.The computational complexity is linear in the number of iterations, in the size of the initial training set, and in the complexity of the adversarial example generation algorithm.The empirical evaluation shows that the proposed approach can scale to realistic scenarios.

V. EXPERIMENTAL RESULTS
Results of application of our approach are presented for four application domains related to robotic and cyber-physical systems.We evaluate the performance improvement (measured as F1-score) of the augmented detectors on different training set sizes and provide insights about the mechanisms that generate the improvement.The adversarial examples generated by H-ADV and L-ADV are shown to be very similar to original examples (for images, we would say that they are indistinguishable) and to yield performance improvement if added to the training set.In fact, perturbations of the same intensity but performed using (non-adversarial) baseline augmentation methods are not able to achieve the same performance improvement.The Python code of the experiments is available 7 .Software and hardware details are provided in Section 4 of the supplementary material, available online.

A. Experimental Setting
Given a specific application domain and a related dataset D containing multiple time series for a process of interest (in which each time series has been standardized), our main results show that the proposed data augmentation techniques H-AUG and L-AUG outperform the baseline augmentation techniques R-AUG, D-AUG, G-AUG, and S-AUG.We also show that HHAD performs as or better than state-of-the-art detectors based on vanilla autoencoders (AEs), LSTM autoencorders (LSTM-AEs), and one-class support vector machines (OCSVM) (see details in Section 3.2 of the supplementary material, available online), hence the performance of our augmented detectors H-AUG and L-AUG exceeds not only that of HHAD but also the state-of-the-art.
We assume to have a nominal HMM λ N with K hidden states (chosen by optimizing the BIC) and trained on a training set of nominal time series X which is part of dataset D. Another part of D, called T (i.e., test set) in the following, is used to evaluate the performance of the anomaly detection and the data augmentation and retraining algorithms.We also assume a specific window size w and a threshold τ learned on X as described in [7].With these three elements, namely, λ N , w, and τ , a complete instance of the original anomaly detector HHAD is available.Notice that, when the number of variables in the dataset is high, feature selection or dimensionality reduction could be necessary to obtain good performance of the original anomaly detector.The main dimensions of analysis that we consider are the type of loss function used to generate adversarial examples (i.e., Hellinger distance or likelihood) and the size of the training set |X|.Table I summarizes the parameters of all the experiments described in the following subsections.The number of repetitions #rep of each test on different training sets of the same size is set to 30 in all domains.It means that, given a training set size |X| we recompute the performance improvement 30 times, and each time we train the HMM λ N and the threshold τ on different training sets of size |X|.

B. Performance Measures
Anomaly detection performance of algorithm HHAD is evaluated by F1-score [56] (Cohen's κ-score and recall are also reported in Section 3 of the supplementary material, available online) on a test set T which is kept separated from the training set used to learn λ N and τ .We consider positive the nominal examples and negative the anomalous examples.Hence, true positives are nominal examples correctly classified by HHAD, true negatives are anomalous examples correctly detected by HHAD, and so on.The value of the F1-score must be maximized.
Data augmentation and retraining algorithms H-AUG and L-AUG, using loss functions based on the Hellinger distance and the likelihood, respectively, are evaluated by two measures.First, we compute the improvement of F1-score on the test set induced on HHAD by data augmentation and retraining.This is the difference between the F1-score on the test set after and before data augmentation and retraining.To test the statistical significance of this difference we apply algorithm HHAD-AUG to several training sets containing different examples and with different size (see results in the next sections) and then use Student's t-test for testing the null hypothesis that the average performance on test sets are significantly improved.Then we evaluate the improvement of the robustness to adversarial attacks introduced by HHAD-AUG.We measure it as the difference in percent success rate SR% between original and augmented HHAD.The percent success rate (SR%) of a detector is the percentage of perturbed examples (on the test set) that are actually misclassified by the detector.A small SR% means that the detector is robust to adversarial attacks generated on the test set.Differences in percent success rate SR% between HHAD and HHAD-AUG are shown in Tables 3, 7, 11, and 15 of the supplementary material, available online.

C. Tennessee-Eastman Industrial Process (TE)
In this domain a synthetic model of a real industrial chemical process is used to evaluate process control strategies [14].
Domain and dataset description: This domain has become popular in the Industrial Control System (ICS) security community because it allows to test attack and defense approaches on a realistic (although simulated) environment.We used the dataset provided by [43] which contains integrity attacks on both sensors and actuators.We use data related to the stealth attack named SA1.They have 41 variables and 4801 observations.A label is available for each example, namely, 0 for nominal observations and 1 for anomalous observations.The training set is generated taking a slice of sequential nominal observations of length |X| ∈ {250, 500, 750, 1000, 1250, 1500} (see Table I Experimental parameters: For each training set, we reduce the dimensionality to the first 4 principal components (using PCA).The number of hidden states K of the nominal HMM λ N is then selected by BIC in the interval [2,15].Diagonal covariance matrix is used.The window length is w = 100 and the maximum perturbation size is = 0.05.The number of iterations of the data augmentation and retraining procedure is m = 3 (see Table I).
Results: Fig. 3 shows the main results.The x-axis represents the training set size |X| and the y-axis the F1-score on the test set.The blue solid line is the original detector HHAD with related 95%-confidence interval (shaded area).Notice that this detector is exactly the one that we use to generate adversarial examples in the initial iteration of data augmentation.Dashed lines with other colors represent different data augmentation strategies, namely, orange is H-AUG, green is L-AUG, red is R-AUG, purple is D-AUG, brown is G-AUG, and pink is S-AUG.The average performance improvement achieved by H-AUG is statistically significant for all training set sizes.In particular, the F1-score of the augmented anomaly detector (i.e., HHAD with λN and τ ) and show that it is higher than that of the original detector (i.e., HHAD with λ N and τ ).L-AUG provides a statistically significant improvement only for |X| ∈ {750, 1000, 15000}.A motivation for this is provided below.Interestingly, HHAD augmented by baseline methods do not achieve any significant performance improvement with respect to the original HHAD.Overall, these results show that the proposed adversarial data augmentation strategy is effective in this application domain.
Table II provides a quantitative evaluation of the performance improvement achieved by each data augmentation method with respect to the original detector HHAD.The first two rows show, respectively, the average F1-score μ F 1 and related standard deviation σ F 1 for HHAD.Both statistics are computed over 30 repeats for each training set size.Then, for each data augmentation algorithm we show in the white rows the average F1-score and standard deviation, and in the gray rows the difference of average F1-score Δ F 1 (i.e., augmented detector minus original detector) and the p-value of the Student's t-test (p-val) for the difference in the average F1-scores.We consider performance improvements statistically significant when the p-value is less than 0.05.These values are highlighted in bold in the table.The improvement of H-AUG is large with small training sets and it decreases as the training set size increases, as expected, hence our methodology could be effectively used in applications with small amounts of data to improve the detection performance.For instance, the average F1-score (μ Comparison with state-of-the-art anomaly detectors: Fig. 4 shows the average F1-score of the original anomaly detector HHAD, the augmented detectors H-AUG and L-AUG, and the three state-of-the-art (non-augmented) detectors AE, LSTM-AE, and OCSVM (see details in Section 3.2 of the supplementary material, available online), for different dataset sizes.Clearly, H-AUG outperforms all methods.L-AUG have slightly better performance than HHAD but worse than H-AUG.The detectors based on artificial neural networks have very bad performance in this case, because of the small size of the dataset and specific data structure.OCSVM starts to increase its performance from 1000 examples but it does not reach the F1-scores of HHAD.Further details are provided in Section 3.3 of the supplementary material, available online.

D. Secure Water Treatment Testbed (SWaT)
As a second test on an industrial CPS we use SWaT, a scaled-down version of a real-world industrial water treatment plant [15].We report the description of domain and experiments, and full results in the supplementary material (Section 3.4), available online.Results are successful also in this case.The average performance improvement achieved by both H-AUG and L-AUG with respect to HHAD are statistically significant for all training set sizes.Baseline augmented detectors R-AUG and D-AUG in this case manage to improve the average F1-score on the test set only for the smaller datasets but the improvement is smaller than that achieved by H-AUG.For larger training set sizes these two baseline methods achieve negative or null improvement, while the proposed methods always get significant improvement.G-AUG and S-AUG do not achieve significant improvement with respect to.HHAD.The improvement of H-AUG keeps almost constant with the increase of the training set size, showing that the F1-score can still increase also starting from relatively large datasets (i.e., 2500 examples).Also in this case the gain is relevant, since the F1-score obtained by the detector augmented with H-ADV on 5000 examples is larger than the F1-score obtained by the original detector HHAD using 20,000 examples, with a "saving" of about 15,000 examples.The average improvement of success rate Δ SR% of adversarial attacks on the test set is always negative, meaning that the data augmentation improves the robustness of the detector to adversarial attacks.

E. UAV Fault and Anomaly Detection (ALFA)
The third domain is a robotic one related to Unmanned Aerial Vehicles (UAV).The dataset 8 presents several fault types in control surfaces of a fixed-wing UAV for use in Fault Detection and Isolation (FDI) and Anomaly Detection (AD) research [16].It includes processed data for 47 autonomous flights with 23 sudden full engine failure scenarios and 24 scenarios for seven other types of sudden control surface (actuator) faults, with a total of 66 minutes of flight in normal conditions and 13 minutes of post-fault flight time.The platform used for collecting data is a custom modification of the Carbon Z T-28 model plane.The average performance improvement achieved by H-AUG with respect to HHAD on the ALFA domain is statistically significant for all training set sizes.Fig. 5 graphically shows these results.L-AUG has on average better F1-score than HHAD and the difference is statistically significant for all training set sizes except for |X| = 250, however the amount of the improvement is less than that achieved by H-AUG.Baseline methods R-AUG, D-AUG G-AUG, and S-AUG do not achieve any significant performance improvement except for a small improvement obtained by D-AUG on |X| = 1500.In this case the larger improvement is achieved by H-AUG on small and medium-size datasets, with a maximum improvement of the F1-score of 0.313 for |X| = 500.Out of the four, this is certainly the most difficult dataset containing complex behaviours of a real autonomous system, in fact, anomalies are often not recognizable at human inspection.Nevertheless, H-AUG manages to strongly improve the anomaly detection performance, reaching F1-score up to 0.909 with the larger training sets considered in our analysis.Section 3.5 of the supplementary material provides full details about the results of experiments performed on this domain, available online.

F. Water Monitoring With ASV (INTCATCH)
The fourth domain is again related to mobile robots but in this case we consider an Autonomous Surface Vessel (ASV).The dataset [17] has been generated by an ASV called Platypus which operates in the context of the INTCATCH Project9 , an EU H2020 project aiming at developing new paradigms for water monitoring in river and lakes.Also in this case the average performance improvement of both H-AUG and L-AUG with respect to HHAD and the baseline data augmentation techniques is statistically significant for small training set sizes while it becomes negligible for larger sizes.The robustness against adversarial attacks on the test set also improves.A detailed description of the domain, the experiments, and the results is reported in Section 3.6 of the supplementary material, available online.

VI. CONCLUSION
Detection of anomalous behaviours in intelligent systems that operate in the physical world, such as autonomous robots and CPSs, requires tools able to represent nominal behaviours and discover patterns that do not match with them in sensor traces.HMMs are a viable and established tool for modeling dynamical behaviours contained in multivariate time series and recent methods use them to detect anomalous behaviours in robotic systems.The approach of adversarial data augmentation presented in this work improves the detection performance of such tools without using any prior knowledge about the form of the nominal time series, as traditional data augmentation requires.
The adversarial examples we generate are multivariate time series very similar to the original but able to produce a significant performance improvement (up to 0.313 of F1 improvement in our empirical tests, see Δ F 1 of H-AUG on ALFA with training set of 500 samples) if used to augment the training set of our detector.The same examples improve also the robustness of the detector to adversarial attacks (with an increase up to 8.6% in our experiments, see Δ SR% of H-AUG on TE with training set of 1000 samples).
This paper paves the way for future work along several directions.We will concentrate on three main extensions.The first concerns the introduction of other time series distance measures in the adversarial example generation strategy.The second is related to the application of the proposed approach to other types of anomaly detectors, such as autoencoders or one-class Support Vector Machines.The third extension focuses on the application of adversarial data augmentation to active anomaly detection, which aims to make the system controller actively looking for possible anomalies to detect them more precisely and promptly.
m that adversarial examples are generated on the training set.Outputs are the augmented training set of nominal examples Ŵ, the augmented nominal HMM λN (trained on Ŵ), and the augmented threshold τ learned from λN and W (the original training set generated from X).

Algorithm 4 :
Adversarial Data Augmentation and Retraining (HHAD-AUG).and 1 for anomalous examples, hence the output of HHAD can be 0 or 1 and in line 14 y == 1 means that the window W t has been labeled as an anomalous.When all examples in the training set have been perturbed, the nominal HMM is retrained using the augmented dataset Ŵ (line 18) which contains original examples and adversarial examples.The retrain is performed from scratch to avoid any bias in the HMM parameters.The threshold is then updated, only if its value is increased, to the value of the maximum Hellinger distance computed using the augmented HMM on examples in the original dataset W (lines 19-22 in Algorithm 4).Again, we observe that adversarial examples are added to the augmented dataset of nominal behaviours only if they have been classified as anomalies by HHAD.They are added as nominal examples, hence the learning process remains one-class also after data augmentation.This is the main idea of our approach, and it is based on the intuition that the adversarial examples are very close to the original nominal examples, hence we consider them as misclassified by the original detector.Finally, in line 7 we consider only examples W t from the training set in all m iterations, i.e., adversarial examples are not considered as original examples to generate other examples.This guarantees that the decision boundary is not moved away from the training examples indefinitely.The generation of adversarial examples from each example in the training set is iterated m times (lines 5-23).Each time the HMM is retrained and the threshold τ updated.The algorithm augments the training set m times, iteratively, each time using adversarial examples based on the current version of the HMM.At the first iteration the HMM is the original one, trained using examples from the training set W. At the second iteration the HMM is retrained using both the examples in the original training set and the adversarial examples generated at the first iteration (all labeled as nominal).At the third iteration the HMM is retrained another time, considering the examples previously generated.Our empirical analysis shown that the main improvement in terms of F1-score of the augmented detector is achieved in the first three iterations, hence we set m = 3 in our tests.The updated HMM λN and threshold τ provide an actual performance improvement in terms of anomaly detection accuracy and other measures discussed in the next section.
).The test set is a sequence containing |T | = 3201 observations, of which 2400 are nominal and 801 anomalous.The training sets are selected in the interval between time steps 0 and 1599, and the test set in the interval between time steps 1600 and time step 4801.For each training set size, we generate 30 training sets (sampling the original dataset in different positions) and we compute mean and standard deviations of the performance in the test set.

Fig. 3 .
Fig. 3. Average F1-score (over 30 datasets) for the original detector HHAD and augmented detectors H-AUG, L-AUG, R-AUG, D-AUG, G-AUG, S-AUG on different training set sizes in the TE dataset.

F 1 )Fig. 4 .
Fig. 4. Average F1-score (over 10 datasets) for the original detector HHAD, augmented detectors H-AUG and L-AUG, and state-of-the-art (non augmented) anomaly detectors AE, LSTM-AE, and OCSVM on different training set sizes in the TE dataset.

Fig. 5 .
Fig. 5. Average F1-score (over 30 datasets) for the original detector HHAD and augmented detectors H-AUG, L-AUG, R-AUG, D-AUG, G-AUG, and S-AUG on different training set sizes in the ALFA dataset.

TABLE I SUMMARY
OF EXPERIMENTAL PARAMETERS USED IN ALL APPLICATION DOMAINS