A Generative Approach to Open Set Recognition Using Distance-Based Probabilistic Anomaly Augmentation

Machine learning (ML) algorithms that are used in decision support (DS) and autonomous systems commonly train on labeled categorical samples from a closed set. This, however, poses a problem for deployed DS and autonomous systems when they encounter an anomalous pattern that did not originate from the closed set distribution used for training. In this case, the ML algorithm that was trained only on closed set samples may erroneously identify an anomalous pattern as having originated from one of the categories in the closed set, sometimes with very high confidence. In this paper, we consider the problem of unknown pattern recognition from a generative perspective in which additional synthetic training samples that represent anomalies are added to the training data. These synthetic samples are generated to optimally balance the desire to place anomalies all along the boundary of the training set in feature space, while not adversely effecting core classification performance on the test set. We demonstrate the efficacy of distance-based probabilistic anomaly augmentation (DPAA) that is proposed in this paper for a diverse set of applications such as character recognition and intrusion detection, and compare its combined classification and identification performance to both recent open set and more traditional novelty detection approaches.


I. INTRODUCTION
Novelty and outlier detection are popular approaches for recognizing anomalies and/or anomalous behavior and are commonly used in decision support and autonomous applications such as medical diagnostics, fault detection in manufacturing processes, fraud and intrusion detection [1]- [4]. The objective of novelty detection is to identify patterns that are not representative of the data used to train the detector, and can broadly being characterized as either being discriminative or generative in nature [5]. Discriminative approaches can further be categorized as being statistical or distance-based [6]. Popular examples of distance-based approaches include the one class support vector machine (OC-SVM) [7], isolation forests (IF) [8], and the deep The associate editor coordinating the review of this manuscript and approving it for publication was Seifedine Kadry . neural network (DNN) autoencoder (AE) [9] each of which uses a distance measure from the training set boundary to identify anomalies. An OC-SVM identifies support vectors that encapsulates the training set, and rejects any new sample that falls on the other side of the boundary [10]. Autoencoders compress the data down into a lower dimensional latent space, and measures the (typically 2-norm) reconstruction error after decompression to determine if the sample is an anomaly [11]. Isolation forest is a tree ensemble based method that measures anomalies by the depth of decision before reaching a leaf node. The more shallow and the fewer the cuts needed to reach a leaf node, the more likely it is that the sample is an anomaly [12]. Although OC-SVMs, AEs and Isolation Forests use different distance measures to identify anomalies, they represent some of the most mature and commonly employed discriminative approaches to novelty detection [13].
Statistical approaches to anomaly detection, in contrast to distance-based methods, try to model the distribution of the training samples from the known categories and reject any sample that is statistically dissimilar by applying a threshold. The so-called reject option was defined in [14] to indicate when the posterior probability of a sample x belonging to any class ω i is such that ∀i ∈ [1, 2, · · · , K ], p(ω i |x) < T is less than a prescribed threshold T . This approach was found to be highly sensitive to the accuracy of the posterior estimates, and that a per-class threshold may improve performance [15], [16]. A direct approach to modeling the multivariate distribution of the known data is by constructing an empirical copula and then using this model to predict extreme events in subsequent data during operation or test to detect anomalies [17].
Although novelty detection is useful for accept/reject type of applications, it is not capable on its own in multi-category classification tasks, where the objective is to differentiate not only between known and unknown (anomalous) categories, but also to differentiate between different classes within the known categories [18]. Multi-category classifiers in supervised learning applications are commonly trained on data from a closed set. So, when these classifiers are presented with data that originates from a category outside of the training set, the classifier will potentially identify this sample as originating from one of the existing categories, and not uncommonly with high confidence [19], [20]. This can have devastating consequences, such as the time in 2016 when a lack of diverse training data resulted in a series of catastrophic failures in a CNN-based vehicular autopilot [21]. In contrast to closed set classification and reject/accept type anomaly detectors, open set recognition is the process of predicting both classes from the closed set and identifying anomalies originating from the space of unknown categories at query time. In [22], open set risk minimization was formulated as arg min which is a trade between empirical risk R (f ), or the risk of misclassification within the closed set, and open space risk R O (f ), or the risk of assigning a label to the unknown space, with H representing the set of recognition functions. One approach to address the open set problem is the extreme value machine (EVM) which was derived using aspects of Extreme Value Theory (EVT) [23], [24], which has been shown to provide an abating bound on open set risk [25]. A core outcome of EVT is that the extreme values (tails) of a well-behaved continuous distribution can only assume a limited number of parametric forms, in particular, the Gumbel, Frechet and reverse Weibull distributions [26]. The EVM algorithm identifies the minimum (or transformed maximum) pairwise distance of a query point to the closest sample in the closed set and uses EVT to show that this distance follows a Weibull distribution. This enables the construction of an inclusion function to determine if the sample belongs to the class with the smallest pairwise distance, or belongs to the unknown (anomalous) class. The parameterization of the Weibull distribution and the choice of a statistical threshold δ is obtained using the closed set samples by parametric fitting and cross validation, respectively [27]. EVM represents one of the highest performing statistical approaches to open set recognition that is capable of kernel-free nonlinear variable bandwidth recognition in open set multi-category classification applications [28]. Like any of the approaches that model a distribution, there needs to be a sufficient and representative set of data for parametric fitting using EVT [29]. EVM, OC-SVM, IF, copula outlier detection (COPOD) and AE represent discriminative approaches to either open set recognition or accept/reject detectors. In this paper, we present a generative approach to open set recognition. The approach we take -distance-based probabilistic anomaly augemtation (DPAA) -directly addresses the open set recognition problem by reformulating (1) as a constrained optimization. The objective of DPAA is to minimize open space risk subject to an empirical risk constraint and can work with any classification algorithm. We comparatively demonstrate the open set recognition efficacy of DPAA using well-known multi-category data sets against state of the art discriminative and statistical approaches.
The rest of this paper is organized as follows. In Section II we review popular approaches to open set recognition. In Section III, we describe the DPAA algorithm and its formulation to address open set risk. In Section IV, we present a graphical assessment of how DPAA works, as wells as quantitative assessment of its performance on both open set recognition tasks as compared to EVM using an F 1 -measure, as well as in accept/reject applications against OC-SVM, IF, COPOD and AE. We then conclude with a brief summary.

II. RELATED WORK
The term open set recognition was popularly coined by Schreirer in [22] to refer to the scenario in machine vision applications that not all classes present at query time are available during training. There has been a significant research thrust in the deep learning community to address this problem, and one early popular discriminative approach was OpenMax [30]. OpenMax calculates a 'mean activation' value derived from the penultimate layer of a deep learning network prior to the SoftMax output to generate an EVT-based weighting function. This weighting function modulates the SoftMax decision so that if an anomalous pattern were present, then the maximum categorical 'probability' would ideally be smaller than a threshold chosen to balance correct classification and open set rejection. An extension of OpenMax is Classification-Reconstruction learning for Open Set Recognition (CROSR) [31]. In CROSR, a deep hierarchical reconstruction net is formed in which intermediate layers of the network are compressed into latent space and reconstructed. CROSR uses this to construct a per class VOLUME 10, 2022 distance measure that is the L2 norm difference between both the activation and latent vectors and the per class means, using an OpenMax EVT-based framework to reject outliers. In the so-called Objectosphere approach [32], there are two modifications made to the loss function during training. The first is to define an entropic loss which treats known and unknown classes during training separately. Although the known class losses remain unchanged, unknown classes have their score uniformly distributed across all the known classes, i.e., the maximum entropy response. The second modification is to create an Objectosphere loss in which known classes that have small feature vector magnitudes that are inside the Objectosphere boundary are penalized, as are unknown classes with large ones. OpenMax, CROSR and Objectosphere each represent discriminative approaches to open set recognition in deep learning networks. Both OpenMax and CROSR leverages EVT, while Objectosphere modifies the loss function and requires that there be negative examples (outliers) during training.
A generative approach that builds on OpenMax is Generative OpenMax (G-OpenMax) for Multi-Class Open Set Classification [33]. Like OpenMax, G-OpenMax uses Weibull calibrated scores based on distances from the mean activation vectors in the penultimate layer of the network. However, in addition, G-OpenMax uses a conditional generative adversarial network (GAN) to create samples from the unknown category to generate well-calibrated probability scores for anomalies. Open generative adversarial networks (OpenGAN) [34] augments a classifier that already has access to open set samples with GAN generated data. Training is conducted in a manner that is similar to traditional GAN training. In the class conditioned auto-encoder (C2AE) [35], an encoder-classifier is trained in tandem and the weights are frozen. Then, the encoder (with weights frozen) and decoder are trained to generate images using a class conditional label that results in a large reconstruction error when the label does not match the class identity, and a small reconstruction error when it does. EVT is used to model the reconstruction errors with an associated threshold to identify outliers.
Each of the approaches to open set recognition described above are designed to work with deep learning networks that are principally operating on images. These discriminative and generative techniques, however, were not designed to work with other high-performance machine learning algorithms such as the Light Gradient Boosting Machine (LGBM) [36] in DSS applications. Further, generative models that rely on GANs are subject to instability during training, which may require access to negative examples (outliers) during training for stabilization [34]. There are also traditional approaches to open set recognition that rely on distancebased discrimination. The Nearest Non Outlier (NNO) [37] builds on the Nearest Class Mean (NCM) [38] classifier to identify both categorical samples and outliers based on their Euclidean distance. NNO uses the concept that non-negative combinations of abating functions (e.g., distances) can be thresholded to minimize open space risk. In [25], thresholded scores from a Weibull-calibratd SVM (W-SVM) are used to reject outliers. Traditional methods, however, may not represent the highest performing closed set classifier for the intended application.

III. TECHNICAL APPROACH
The DPAA algorithm generates synthetic anomalies at a statistically prescribed distance to the closed set boundary. This distance is balanced against a constraint that the empirical risk be no greater than a differential error rate , that is, the difference between classification performance on the closed set trained with and without anomalies. The distance used to generate and accept synthetic samples that represent anomalies is directly related to a divergence measure that relates the relative difference in the the anomalous and closed set distributions. DPAA directly addresses (1) by bringing samples in as close as possible to the boundary (irrespective of the boundary shape) to minimize open space risk, while ensuring that the differential error rate -or empirical risk -is no greater than .
To generate candidate synthetic anomalies, each of the data points in the closed set is used as a pivot, from which samples are randomly generated. The distance to the pivot from which samples are generated is directly related to the empirical distribution of distances within the closed set itself. The statistical measure of distances is used to quantify the (dis)similarity of the distribution of synthetic anomalies and closed set samples, and the choice of (dis)similarity is used in the process of minimizing the open set risk in (1). Because points in the 'interior' that are used as pivots will result in samples being generated inside the closed set as opposed to on or outside its boundary, sample generation is a twostep process: 1) generate candidate synthetic anomalies and 2) accept only those candidates on the boundary. The following subsections describe both the mechanics and mathematics of each of the points above in detail, and the notation used throughout is defined in Table 1.

A. CANDIDATE SYNTHETIC ANOMALY GENERATION
To generate synthetic anomalies, we use a 1-nearest neighbor (1NN) distance measure of an anomaly with respect to the closed set and the distribution of 1NN distances within the closed set itself. To that end, consider the empirical cumulative distribution (ECDF) of 1NN distances calculated from every training sample in the closed set to all other samples, i.e., Using these distances, a single distance is selected which satisfies where Q 1 and α 1 > 1 are hyperparameters that are used to define the 1NN ECDF quantile and distance from that quantile, respectively. In (2), d p is used as pivot distance from which synthetic anomalies are generated as where is sampled from a D-dimensional isotropic Gaussian distribution. The intuition behind Eqs. (2), (3) and (4) is that a random D-dimensional sample is generated at a fixed pivot distance d p with respect to each of the samples in the closed set. This distance is greater than or equal to the distance of the Q 1 th quantile (depending on the value of α 1 ) of the closed set pivot and is instantiated at an angle [θ 1 , θ 2 , · · · , θ D−1 ] that is uniformly distributed over (0, 2π ] [39]. This process of generating anomaly candidates a c is graphically illustrated in Fig. 1 for three cases of interest. The first case (C 1 ) has a synthetic sample generated at a distance d p from the pivot x i that lands squarely within the closed set boundary. In cases C 2 and C 3 , the synthetic samples are are also generated at a distance d p from the pivot, but in this case fall outside the closed set boundary with varying degrees of proximity. These three cases graphically illustrate the sample generation process to help visualize the intuition behind the sample anomaly generation and acceptance process, which is quantified in the next section. It is clear that the sample associated with C 1 landing within the boundary of the closed set might be mistaken for a closed set sample, or vice versa, increasing empirical risk. The anomalies C 2 and C 3 land at various distances outside the boundary, but are potentially too far away from the closed set boundary to mitigate open space risk. The question becomes which synthetically generated anomalies should be accepted and which should be rejected?

B. CANDIDATE SYNTHETIC ANOMALY ACCEPTANCE
The mechanics of candidate acceptance partially mirrors that of candidate generation, i.e., we look to find candidates whose kNN distance to the training set is sufficiently greater than the kNN distances within the training set. The mathematical justification for this observation [40] follows from divergence between the distribution of closed set sample f (X ) and the distribution of synthetic anomalies f (A) where the constant K = N /(S − 1). Eq. (5), which is an application of the general result in [41], states that a consistent estimate of the divergence between the training set multivariate distribution and the distribution of synthetically generated anomalies is directly related to kNN distances. Maximizing the numerator in (5), however, would push anomalies far away from the boundary of the training set, geometrically leaving a gap in open space into which anomalies could fall and remain undetected. To address this concern we consider the formulation of a constrained optimization to geometrically surround the (possibly nonconvex) training set with anomalies without suffering a significant loss in core classification performance on the test set after training. To this end, and in an analogous manner to (2), we first define a decision rule with which represents the minimum distance from the set of kNN distances greater than or equal to the Q th k quantile, where synthetically generated candidates in A c are accepted VOLUME 10, 2022 according to the rule: The problem then becomes one of deciding how to select an optimal α 1 , Q 1 and d p in (2) to generate synthetic candidates in (3), and d k and Q k to screen anomalous candidates in (7). To this end, let f X ∪A (·) and f X (·) represent the classifiers trained with and without the synthetically generated anomalies, respectively, and I(·) the indicator function defined as with where | · | represents that cardinality of the set with P X (X ) and P X ∪A (X ) representing the classification accuracy on the test set. Then, with c = [1, 1, 1] T , and θ = [ finds the tightest boundary around the training set to place anomalies while ensuring that the classification performance on the test set when the classifier is trained with and without anomalies suffers a differential error rate no greater than . This effectively trades empirical risk (differential error rate) against open space risk (tightness of boundary) from (1) in the form of a constrained optimization in (10).

C. A PRACTICAL PROCEDURE FOR CANDIDATE GENERATION AND ACCEPTANCE
Although (10) quantifies a mathematically precise way of optimizing the generation of synthetic anomalies, a closedform solution may not be possible due the fact that P X ∪A(θ ) (X ), which is calculated using a classifier trained using synthetically generated anomalies, is a highly nonlinear function of θ , cf. (2), (6), (7) and (9). To address this challenge, a fixed point iterative approach to approximately solve (10) is summarized in Algorithm 1. The parameters that control the generation and placement of synthetic anomalies are the same as those in (10), namely α 1 , Q 1 and Q k . These parameters form the core search space over which Alg. 1 operates. The parameters α 1 and Q 1 influence the generation of samples in (2), where increasing these parameters push samples in A c further from the boundary of the training set, making it more likely the candidates are accepted into the set A. In concert with α 1 and Q 1 , Q k controls the likelihood of candidates from A c being accepted into A. But,
Consider the first the adaptive selection of Q 1 and α 1 . A modified binary search for the values Q 1 and α 1 is used to find the minimum number of samples that meet the candidate selection requirement as specified in (7). Once the candidates are selected, a test is used to determine if the differential error is within the range [ − , ]. If the differential error is outside of this range, then Q k is adjusted. This is in contrast to (10), where only a single differential error is specified. The rationale is as follows. In the constrained optimization approach as specified in (10), minimizing Q 1 , α 1 and Q k brings samples in towards the boundary, and this is balanced against the constraint that the addition of synthetic samples in the training dataset preserve core classification performance, i.e., the differential error is no greater than . This is approximately captured in the upper and lower bound on the differential error in Alg. 1. If the upper bound is violated, Q k is adaptively increased which forces the selection of samples that are further away from the boundary of the test set. Conversely, if the minimum error − is violated, then the samples are selected that lie closer to the boundary by reducing Q k . A binary search is used to adaptively adjust Q k so that samples fall within differential error window [ − , ]. As → 0, Alg. 1 more closely approximates 10, but this comes at the expense of an increase in the search time. We have found that in practice a that is between 25% and 35% of works well in practice, and this is topic is expanded on further in Section IV where results are presented. The inialization parameters

D. CATEGORICAL DATA AND NORMALIZATION
Data normalization is commonly employed in machine learning when the features used for training and classification are on different scales. In the examples that follow, we use standard normalization with where µ i is the mean value of feature i and σ i its standard deviation. Some data sets will also include categorical features, i.e., features that can only take on a finite and discrete set of values. In this case, the features generated using (3) are snapped back to a grid after acceptance using (7) such that where a n ∈ A K ∈ A are the subset of columns corresponding to range of possible choice for categorical, integer and discrete valued features.

E. DISCUSSION
The question of how to generate and where to place anomalies is not fully answered by (5), but this does offer quantitative evidence for the idea that -from an information-theoretic perspective -kNN distance is a measure of statistical (dis)similarity. Indeed, maximizing (5) will result in samples with maximum dissimilarity but this would leave a region in feature space where an anomaly encountered in the field would have a KL divergence measure 'closer' to the training set than that of the synthetically generated anomalies. Instead, we consider the question of how to minimize the numerator in (5) (given the denominator is fixed) while maximizing the likelihood that samples from the (closed) test set are correctly classified. In theory, it is possible to turn this into a constrained optimization, with the constraint that that differential error rate between a classifier trained with and without synthetic anomalies is less than a prescribed threshold , cf. (10). This quantitatively addresses the question posed in Sec. III-B of how to find candidates with kNN distances sufficiently different from those in the training set, while also balancing the empirical risk. The choice to use accuracy in (9) could be replaced by any other measure such an F β -measure (e.g., F 1 ) or AUC. Given the highly nonlinear nature of the constrained optimization, the pseudo-code for a computationally efficient approximation to (10) is presented in Alg. 1, and this is the algorithm which was used to obtain the results presented in Section IV.

IV. COMPARATIVE PERFORMANCE A. VISUALIZING ANOMALY GENERATION
To get a visual sense of how anomalies are generated and the impact that the choice of fixed and searchable hyperparameters have on performance, we've used the so-called Banana  dataset from Google's standard classification library [42] which is plotted for reference in Fig. 2. There are two categories -0 and 1 -in the Banana dataset, with each category containing roughly 2500 samples for a combined dataset size of 5000 samples. A total of 5000 synthetic anomalies were generates to match the size of the banana data set. To derive the baseline performance P X (X ), an XGBoost classifier [43] was trained and tested on the two class data set, with a training set X representing 80% of samples chosen at random, and a test set X made up of the remaining 20% of the samples. The first set of synthetic samples A derived from the Banana data set is plotted in Fig. 3. The differential error = .01 and offset = .0035 were chosen resulting in a differential error range of [.0065, .01%], and the directed optimization of Alg. 1 recovered the ECDF quantile parameters Q 1 = 1, Q k = 0.9995 for k = 4 and an α 1 = 1. The synthetic anomalies generated all landed on the boundary of the Banana data set, but at a sufficient distance so that classifier (in this case XGBoost) could differentiate them from the the original 2-categories {0, 1} at a loss no greater than = .01 in classification performance. The second set of synthetic samples derived from the Banana data set is plotted in Fig. 4, but this time with a differential error range of [.0325, .05], which resulted in the ECDF quantile parameters Q 1 = 1, Q k = 0.995 and an α 1 = 1. In this case, synthetic anomalies tightly hugged the boundary of the Banana data set, resulting in more samples from the test set X being mistaken for anomalies, but at a rate no greater than = 0.05.

B. QUANTITATIVE ASSESSMENT -OPEN SET RECOGNITION
Although the banana data set was useful to help visualize how the DPAA algorithm works, it is not sufficient to measure DPAA performance. Therefore, to test the efficacy of DPAA, we compared its performance to that of EVM operating on high dimensional mutli-category datasets using an F 1measure as a function of precision and recall, which is defined as where TP, FP, FN represent the true positive, false positive to false negative rates, respectively, and from which the harmonic mean -or F 1 score is derived. In addition to the F 1 -score, we also measure the ability of DPAA to both detect anomalies (probability of detection, or P D ) and to not falsely assign samples as anomalies (probability of false alarm, or P FA ). We consider P D and P FA separately from the overall F 1 -measure as a function of precision and recall for two reasons. First, there may be a significant difference in cost associated with how anomalies are processed, both in terms of their detection and misclassification. Second, P D and P FA are reflective of an accept/reject type of detector as discussed in Section I, and not multi-category classification which includes an outlier Relative performance of open set recognition for both DPAA and EVM on the OLETTER data set. F 1 performance for DPAA is measured with respect to the top axis in terms of a differential error rate. For EVM, F 1 performance is measured with respect to the bottom axis in terms of a probability threshold δ as described in [24]. category. To start, we used the OLETTER dataset developed in [25] which was used to demonstrate EVM's performance in open set recognition in [24]. The test we performed started with training both EVM and DPAA on 20-of the 26-letters selected at random, where the 6-letters that were held-out from the training set were included in the testing set. An 80/20 train/test split was used for the 20 letters prior to the addition of the 6-letters in the test set, and 10-fold cross validation was used to obtain average performance results. The EVM code used for both hyper-parameter optimization and to obtain the results that follow were obtained from the authors' github repository [44]. Like for the Banana dataset, DPAA used XGBoost for f X (·) and f X ∪A (·) in (9) to measure differential error, P X (X ) − P X ∪A(θ ) (X ). The open set performance of DPAA and EVM operating on the OLETTER test set after training is plotted in Fig. 5.
To obtain results for DPAA, we initialized the parameters in Alg. 1 with the those listed in Table 2 and varied the differential error with values in Table 3 to obtain the results in Figs. 5 -10. For comparison, we varied the EVM threshold δ in [24] usinĝ which has the effect of trading off P D for P FA in an analogous manner to for DPAA. In general, both DPAA and EVM followed similar trajectories in terms of an F 1 -score in the open set recognition task, but with DPAA outperforming it in all cases.
Although the F 1 -score is a comprehensive snapshot of open set recognition, it was also of interest to measure the performance of both algorithms in an accept/reject framework. Because of the wide availability of curated outlier detection algorithms that come bundled in the PyPI 3.7 library PyOD [45], we chose to include a representative set of 4 of these PyOD routines for comparison: Isolation Forest (IF), AutoEncoder (AE), One Class Support Vector Machine (OC-SVM) and Copula Outlier Detection (COPOD). Each of these PyOD routines had a detection threshold that was controlled by a single parameter, contamination, which for our tests varies from 0 to 0.175 in steps of .025. The AE feedforward deep learning architecture was modified from the default to fit the dimensionality of the OLETTER dataset, such that the number of hidden neurons per layer were [15,8,4,8,15]. In all other cases, default parameter settings were used for IF, AE, OC-SVM and COPOD. The probability of detection versus false alarm for all routines is plotted in Fig. 6. It was interesting to note that the PyOD routines, whose only objective was accept/reject, performed quite poorly relative to both EVM and DPAA on this dataset. Similar to the open set recognition performance using an F 1 score, DPAA detection performance was greater than or equal to that of EVM for any chosen P FA in our tests.
It is interesting to take a deeper look at open set recognition performance of DPAA. In Fig. 5, DPAA has its peak F 1 performance at nearly 90%, with a corresponding differential error rate of roughly 8%. One may ask, how is it that a relatively high differential error rate leads to the highest F 1 -score? The reason is that the differential error rate measure is w.r.t. to only the closed set samples in the test set, while the F 1 measure is w.r.t. both the closed set samples and anomalies. This observation will hold in the results that follow.
In addition to the OLETTER dataset, we tested DPAA on the so called multi-feature Fourier (mfeat-fourier) multicategory data set, which is publicly available on the UCI curated multi-category classification website [46]. The mfeat-fourier dataset has 76 features describing the handwritten digits 0-9, and serves as an analogue to the alphabetic OLETTER dataset. Similarly to OLETTER open set testing, we randomly removed a single digit from the training set and used it in testing. Like OLETTER, an 80/20 train/test split was used for the 9 handwritten digits prior to the addition of the 10th digit in the test set, and 10-fold cross validation was used to obtain average performance results. The open set performance of DPAA and EVM operating on the mfeat-fourier test set after training is plotted in Fig. 7. As was true for the OLETTER data set, DPAA outperformed EVM at nearly all cases. For an accept/reject mode of operation DPAA achieved a higher detection rate P d for a given false alarm rate P FA than both EVM and all of the routines from the PyOD library. In this case, however, OC-SVM performance was comparable to that of EVM and significantly outperformed the other outlier detection algorithms.

C. CATEGORICAL DATA
Finally, we consider the problem of intrusion detection from both an accept/reject and open set recognition perspective using the NSL-KDD dataset [47]. This data represents a distinctly different one from either the OLETTER or multi-feature Fourier character based datasets given all of the features are either categorical, discrete or integer in nature, and we leverage (12) in the process of anomaly generation. For this data set, in addition to 'Normal' network traffic, we chose to include Probe, Denial of Service (DoS) and Remote-to-Local types of attacks with attack types tabulated Relative performance of open set recognition for both DPAA and EVM on the multi-feature Fourier data set. F 1 performance for DPAA is measured with respect to the top axis in terms of a differential error rate. For EVM, F 1 performance is measured with respect to the bottom axis in terms of a probability threshold δ as described in [24]. in Table 4 in the training set. The DoS attack pod was held out from the training set and included in the test set as it is one of the most challenging attacks to identify. We used an identical train/test split and cross validation procedure to the one used for both the OLETTER and multi-feature Fourier data sets, with the exception that pod was always held out during training but included during test. The F 1 performance of both DPAA and EVM are plotted in Fig. 9, with DPAA significantly outperforming EVM in recognizing both normal traffic as well as the 14 categories of attack including pod, which as previously mentioned was held from the training set but included in the test set.
We also ran both DPAA and EVM in an accept/reject mode of operation and compared their performance with respect to the four other PyOD routines with the results plotted in Fig. 10. Both DPAA and EVM had a significantly lower Relative performance of open set recognition for both DPAA and EVM on the NSL-KDD intrusion detection data set. F 1 performance for DPAA is measured with respect to the top axis in terms of a differential error rate. For EVM, F 1 performance is measured with respect to the bottom axis in terms of a probability threshold δ as described in [24].  false alarm rate than OC-SVM, IF, AE, and COPOD making them both far more useful in scenarios where a relatively low false alarm is critical for successful system operations where a significant number of normal packets aren't inadvertently blocked.

V. SUMMARY
This paper presented a distance-based probabilistic anomaly augmentation (DPAA) approach to address the open set recognition problem. DPAA generates samples to encapsulate the (possibly non-convex) training a set in feature space. This generative approach is directly formulated to optimize open space risk subject to an empirical risk constraint, whose optimization mechanics are grounded in an information theoretic and statistical measure of closeness. Using a representative ensemble of data sets, DPAA demonstrated superior performance both in novelty detection and open set recognition against some of the highest performing state-ofthe-art algorithms with the flexibility to work in concert with any classifier.