Negative Selection Algorithm Based on Antigen Density Clustering

The negative selection algorithm (NSA) is one of the basic algorithms of the artificial immune system. In the traditional negative selection algorithm, candidate detectors are randomly generated without considering the uneven distributions of self-antigens and nonself-antigens, thereby resulting in many redundant detectors, and it is difficult for these detectors to fully cover the area of nonself-antigens. To overcome the problem of low detector generation efficiency, a negative selection algorithm that is based on antigen density clustering (ADC-NSA) is proposed in this paper. The algorithm divides the process of detector generation into three steps: the first step is to calculate the density of the antigens by using the method of antigen density clustering to select nonself-clusters. The second step is to prioritize the abnormal points (nonself-antigens that are not clustered) as the centers of candidate detectors and to generate the detectors via calculation. The third step is to generate the detectors via the traditional algorithm. Detector generation via these three steps can reduce the randomness of the detector generation in the traditional algorithm, thereby improving the efficiency of detector generation. The experimental results demonstrate that on the BCW and KDD-Cup datasets, the negative selection algorithm that is based on antigen density clustering can effectively increase the detection rate while reducing the false-positive rate compared with the traditional negative selection algorithm (RNSA) and two improved algorithms at the same expected coverage.


I. INTRODUCTION
The artificial immune system (AIS) is a computational paradigm that is inspired by the biological immune system [1]. The artificial immune system has been widely used in computer security, anomaly detection and prediction [2], [3], [23], [24]. The negative selection algorithm is one of the basic algorithms of AIS. This algorithm was proposed by Forrest [4] in 1994 and has been applied to intrusion detection and data classification.
The negative selection algorithm has been widely used in the fields of network intrusion detection, spam detection, medical diagnosis, and fault detection [5]- [8]. However, the negative selection algorithm has the disadvantages of high detector repeat coverage and loopholes. [9], [10], [11]. In response to these disadvantages, many scholars have proposed The associate editor coordinating the review of this manuscript and approving it for publication was Ting Li . improved algorithms. For example, Gonzalez et al. [12] proposed a real-valued negative selection algorithm (RNSA) with an immutable radius. Antigens and antibodies belong to the [0, 1] n value space to maximize the coverage of nonselfregions to improve the detection efficiency. Ji and Dasgupta [13] proposed a variable-radius negative selection algorithm (V-Detector). The main strategy is to randomly generate the detector center x, find the nearest self-antigen to x, calculate the distance r between them, and dynamically generate detectors with x as the center and r as the radius. Chen et al. [14] proposed a negative selection algorithm that was based on hierarchical clustering of self-sets (CB-RNSA). After the hierarchical clustering of self-sets, cluster centers are used to replace self-points, which effectively reduces the computational cost of distance calculations. Liu et al. [15] proposed SDS-RNSA, which uses a subspace density search algorithm to calculate the sample subspace region and directly generates detectors in its subspace to increase the detection rate.
Zhengjun et al. [16] proposed a negative selection algorithm that was based on soft subspace clustering of antigens (ASSC-NSA), which uses clustering to calculate the key features and weights of various types of antigens, thereby reducing the influence of redundant features on the detector, which effectively guides the generation of mature detectors. In 2018, Abid et al. [10] designed a layered real-valued NSA (LRNSA). Different layers are formed according to the distance of candidate detectors from the self-antigens, and the detectors belonging to far-self layer are generated by clustering optimization method. In addition, the algorithm makes the detectors stay apart from each other in order to maximize the coverage and decrease the number of mature detectors. In 2019, Fan et al. [11] proposed ASTC-RNSA, the algorithm first uses the Delaunay triangulation method from the perspective of computational geometry to divide its space into simple units for determining the position of the detector. Then the overlap between the simple unit and the self-antigen is removed to form a set of triangulation coverage areas, and finally detectors are generated within this area. Avoid the time-consuming self-tolerance process of traditional NSAs.
According to the discussion above, the focus of improving the NSA algorithm has been on the efficient generation of detectors. Many scholars have improved the traditional algorithm in response to the problems of high redundancy and loopholes that are caused by the random generation of candidate detectors via the traditional algorithm.
This paper proposes a negative selection algorithm that is based on antigen density clustering (ADC-NSA). Partial detectors are generated via density clustering to reduce the repeated coverage and loopholes that are caused by randomly generated detectors in the traditional algorithm. In this paper, the generation of a detector via antigen density clustering is divided into three steps: the first step is to cluster the antigen via the antigen density clustering algorithm and to select the clustered nonself-clusters as the mature detectors; the second step is to use nonself-antigens that are not clustered as abnormal points to generate mature detectors via training; and the third step is to use traditional algorithms to randomly generate candidate detectors and to train them to generate mature detectors. This process reduces the generation of redundant detectors and enables the algorithm to cover the nonselfarea with as few detectors as possible, thereby effectively overcoming the problems of high detector repeat coverage and loopholes.

II. PROBLEM DESCRIPTION
Traditional negative selection algorithms generate mature detectors (antibodies) by judging whether the candidate detector matches the self-antigens. Then, the data to be detected are matched with the mature detectors. If the matching is successful, the data are abnormal. The basic definitions and process of the algorithm are as follows: Definition 1: Antigen set. The antigen set is 1], n represents the total number of sample points, x i represents the normalized value of sample point i, and A g represents the set of normalized values of all sample points.
Definition 2: Self-antigen and nonself-antigen., Selfantigen self ∈A g represents the positive sample of the sample, and nonself-antigen nonself =A g − self represents the negative sample of the sample. The area that is covered by the self-antigens in the range of the value space is called the selfregion, and the area not covered is called the nonself-region.
Definition 3: Affinity. The Euclidean distance dist 2 between two points represents the affinity between the two points, where x i and x j represent the i-th and j-th sample points, d represents a feature dimension of the sample points, D represents the total number of feature dimensions of the sample points, and x d i represents the d-th dimension feature of the i-th sample point.
Definition 4: Detectors. A detector is denoted by d e (z i , r i ), where z i represents the randomly generated candidate detector center, r i represents the distance from the center to its nearest self-cell, and the circle that is formed by z i and r i corresponds to the mature detector.
As illustrated in Figure 1, the traditional negative selection algorithm simulates the negative selection process in which the immune system recognizes self-cells and nonselfcells. The algorithm randomly generates candidate detectors, and by removing the detectors that have detected the selfantigens, the detectors that can detect any nonself-cell are retained. Finally, a mature detector set is generated for data detection. The main advantages are that no prior knowledge is required and an unlimited number of nonself-antigens can be detected with a limited number of self-antigens [17], [18]. The main disadvantage is that the traditional negative selection algorithm generates random detectors in the detector generation step [11], [19], which leads to problems such as high repeat coverage and loopholes. In Figure 2, the pentagram represents the self-set, and the open circles represent the mature detectors. All regions except the self-set are nonself-regions. The traditional negative selection algorithm expects that the detectors can cover as many nonself-regions as possible. However, when the antigens are unevenly distributed in the sample space, where the antigens are densely distributed, the gaps between the sample points are narrow, which hinders the efficient generation of detectors. Where antigens are sparse, the randomly generated candidate detectors will inevitably be highly redundant, thereby resulting in high repeat coverage of the detectors, and loopholes will form in areas that are difficult to cover.

III. ADC-NSA ALGORITHM IMPLEMENTATION STRATEGY
The traditional negative selection algorithm does not consider the uneven distribution of antigens in the sample space [20]; as a result, detectors cover each other and cause substantial redundancy. For overcoming this challenge, the ADC-NSA is proposed in this paper. First, the clustering algorithm is used to identify high-density regions, and the clustered nonself-clusters are directly used as mature detectors. Second, in the low-density regions, the abnormal points (nonselfantigens that are not clustered) are preferentially used as the candidate detector centers to generate detectors after calculating the radius. Finally, candidate detectors are randomly generated via the traditional algorithm. The generation of detectors via these three steps can reduce the randomness of the detector generation in the traditional algorithm, thereby effectively overcoming the problems of high detector repeat coverage and loopholes and improving the efficiency detector generation.
A. ADC-NSA BASIC DEFINITION Definition 1: The Euclidean distance d ij between sample points x i and x j is: The antigen set A g = {x 1 , x 2 , x 3 , . . . ,x n } contains n sample points, and each sample point has D-dimensional feature attributes and is expressed as represents the d-th dimension feature of the i-th sample point, and the distance between any two sample points x i and x j in the antigen set is calculated using the Euclidean distance, which is also called antigen affinity calculation. Definition 2: The local density ρ i of the sample point x i is: where d c represents the cutoff distance (clustering radius).
In [25], [27], a d c was selected so that the average number of neighbors in each sample point was about 2% of the total number of sample points, and This function is used to calculate the true density of x i . Definition 3: The sample point distance δ i is defined as follows: If ρ i is smaller than ρ j : If ρ i is maximal: Definition 4: The cluster center c i is determined by the size of the cluster center weight γ i . Sort γ i in descending order and set the sample points that correspond to the first K values of γ i as c i [26].
Definition 5: The cluster center weight γ i is defined as: Definition 6: The abnormal point a i is determined according to δ i and ρ i . A point with small ρ i and relatively large δ i is called an abnormal point. In this paper, the nonself-antigens that satisfy these conditions and are not clustered are called abnormal points, which will be preferred as the candidate detector centers.
Definition 7: The cluster discriminant F i is defined as: If F i = 1, the cluster is a nonself-cluster; otherwise, it is a self-cluster. The ε is the category judgment threshold, and it takes 0.99 in this experiment.
Definition 8: The expected coverage c p is defined as: The calculation termination condition is Flag =c p . If Flag = −1, set t = m = 0 and start counting again. Z a is a very small constant. In this experiment, the value of Z a is set as 0.001, and G > max 5 p , 5/ (1 − p) .

B. ADC-NSA ALGORITHM
The basic process of the negative selection algorithm that is based on antigen density clustering is presented as Algorithm 1 and Figure 3: <3>: Add the nonself-cluster that is composed of c i and d c as the first type of mature detector into set Detectors.
<4>: Select the abnormal point a i as the center of the candidate detector preferentially, calculate the distance between a i and the nearest self-antigen, record it as R, and add the circle with a i as the center and R as the radius as the second type of mature detector to Detectors.
<5>: Randomly generate candidate detectors within the collection, and add the third type of mature detectors, which are generated using the traditional detector generation algorithm, to Detectors.
<6>: Reach c p and terminate of the algorithm. Thus, the generation of Detectors ends.

1) ANTIGEN DENSITY CLUSTERING ALGORITHM
The antigen density clustering algorithm clusters antigens based on the density of the antigen distribution, in preparation for the generation of the detectors. The algorithm is based on the following assumptions: (1) The density of the clustering center point is higher than that of the surrounding sample points. (2) The distance between the clustering center point and the higher density point is relatively large.
The process of the antigen density clustering algorithm is presented as Algorithm 2:

Algorithm 2 Antigen Density Clustering Algorithm
Input: set A g = {x 1 , x 2 , x 3 , . . . ,x n } and cutoff distance d c Output: nonself-clustering center c i and abnormal point a i <1>: Calculate Euclidean distance d ij according to formula (1).
<5>: Select the abnormal point a i that satisfies the condition in definition 6.

2) DETECTOR GENERATION ALGORITHM
The generation of the detector is divided into three main steps: (1) Use the nonself-clusters that are calculated via antigen density clustering as detectors. (2) Use the abnormal point a i preferentially as a candidate detector center to generate a detector via calculation. (3) Use the traditional algorithm to generate the detectors. The generation process of the detectors is illustrated in Figure 4, and the process is presented as Algorithm 3: If a detector that is composed of an abnormal point a i and a radius r i includes other abnormal points a j , then remove a j . If there are randomly generated candidate detector points in the self-antigens or other mature detectors, then remove these points. At this time, the number of repeated detectors is m+1; otherwise, the number of mature detectors is t + 1. Calculate

Algorithm 3 Detector Generation Algorithm
Input: Expected coverage c p , self-antigens, nonselfcluster center c i , and abnormal point a i Output: Detectors <1>: self ∈A g , detectors = ∅. <2>: Use the nonself-cluster-center c i as the center of the circle and d c as the radius to form detectors, and add them into Detectors as the first category of mature detectors.
<3>: Calculate the distance r i between a i and the nearest self-antigen, generate detectors with point a i as the center and r i as the radius, and add them as the second category of mature detectors into Detectors.
<4>: Randomly generate a candidate detector center z i within the set. Calculate the minimum distance r i between z i and the self-antigen, and generate the detector with the point z i as the center of the circle and r i as the radius. Add it as the third category of mature detectors into Detectors.
<5>: Stop generating detectors if c p has been reached.
the termination condition Flag =c p according to the statistical hypothesis testing method [13] of formula (7). In Figure 4, hollow dots represent self-antigens with a specified radius, solid dots represent nonself-antigens, large circles represent clustering results, and the regions with selfantigens are nonself-regions. The generated mature detectors should cover as many nonself-regions as possible. Among them are nonself-clusters 1, 2, and 3 and self-cluster 1. These four groups of clusters are calculated via the antigen density clustering algorithm according to the distribution of the antigens. Nonself-clusters 1, 2 and 3 (each of which is defined by a cluster center c i and a cutoff distance d c ) are selected as the first category of mature detectors according to the cluster discrimination formula (6). a i is preferentially used as the center of the candidate detector. Then, calculate the distance between a i and the nearest self-antigen; this distance is regarded as R i . If the detector that is defined by the abnormal point a i and the radius R i contains other abnormal points a j , then remove a j and select the detector as the second category of mature detectors. If a randomly generated candidate detector point is in the self-antigen or other mature detectors, then remove it. As illustrated in the figure, the candidate detector center z i is randomly generated, and the closest self-antigen distance r i is selected as the radius. At this time, the generated detector is used as the third category of mature detectors. Three categories of mature detectors have been generated.

A. DATASETS AND EVALUATION INDICATORS
The datasets that are used in this paper are from the UCI database [21]. Classic datasets BCW and KDD-Cup99, which are often used for anomaly detection and machine learning, are selected. The BCW dataset originates from breast cancer data that were provided by foreign medical institutions. This dataset has 2 categories, 9 attributes, 241 abnormal data, and 458 normal data. This experiment standardizes and normalizes the data, and divides the testing set and training set of the BCW dataset by using a partition function, As shown in Table 1. The KDD-Cup99 dataset originates from 9-week network data connection information that was collected by a foreign LAN. This dataset includes a training dataset and a test dataset. In this experiment, the experimental data are extracted at a ratio of 1:1. Each connection record in the training dataset contains 41 fixed feature attributes and a class identifier. The data features include basic features, network features and content features.
As shown in Table 2. The evaluation indices are the detection rate (DR) and the false-positive rate (FPR), which are commonly used in the classic binary classification problem.
In these expressions, TN denotes the number of truenegative types, which are correctly recognized as selfantigens; TP denotes the number of true-positive types, which are correctly recognized as nonself-antigens; FN denotes the number of false-negative types, which are incorrectly recognized as self-antigens; and FP denotes the number of false-positive types, which are incorrectly recognized as nonself-antigens.

B. EXPERIMENTAL DATA PREPROCESSING
This experiment uses the most common z-score normalization (zero-mean normalization) method, also known as standard deviation standardization. This method gives the mean and standard deviation of the original data to standardize the data.
The processed data conforms to the standard normal distribution, that is, the mean is 0, the standard deviation is 1, and its conversion function is: x * = x − µσ . Where µ is the mean of all sample data and σ is the standard deviation of all sample data. There are two benefits after data normalization: x Improve the convergence speed of the model, y Improve the accuracy of the model.
Due to the large KDD-CUP data set, this paper uses Principal Component Analysis (PCA) to reduce the dimension of this data set, while maintaining the characteristics of the largest variance contribution in the data set. The main steps are: The features are recombined into uncorrelated principal components, which are used to represent the original information. This method can effectively reduce the dimensionality of the sample and improve the calculation accuracy [22].
x n_components: If the value is assigned to string, such as n_components = 'mle', the number of features will be automatically selected to meet the required percentage of variance; if no value is assigned, the default is None and the number of features will not change (the feature data will be changed ). In this experiment, the percentage of variance is set to 99.9%, that is, the similarity with the feature data before dimensionality reduction is 99.9%. y copy: True or False, the default is True, that is, whether the original training data needs to be copied. z whiten: True or False, the default is False, that is, whether to whiten, so that each feature has the same variance.    The experimental parameters are the best parameters that were identified in the analysis of the experimental results. According to the figure, in the BCW dataset, when c p is 99% and d c is 0.35, the detection rate of this algorithm reaches its maximal value and the false-positive rate is relatively low. At this time, the detection rate is 99.41%, and the falsepositive rate is 3.54%.
In the KDD-Cup dataset, the algorithm performs best when c p is 99% and d c is 0.25. The detection rate reaches 99.24%, and the false-positive rate is 3.25%.
The final parameter settings are listed in Table 3:

D. COMPARATIVE ANALYSIS EXPERIMENT
To further evaluate the performance of the algorithm, this paper compares it with three algorithms, namely, RNSA [12], V-Detector [13], and ASSC-NSA [16], on the BCW and KDD-Cup datasets. The statistics are presented in Table 4, and the results of the comparative experiments are presented in Figures 9, 10, 11, and 12.    These results are the averages of the detection rates and the false-positive rates that were obtained via multiple experiments with c p equal to 99%. In this experiment, the detection rate is 99.41% and the false-positive rate is 3.54% on the BCW dataset. On the KDD-Cup dataset, based on PCA dimension reduction, the detection rate is 99.24% and the false-positive rate is 3.25%. It is concluded that ADC-NSA, which is proposed in this paper, realized a higher detection rate and a lower false-positive rate than the three compared algorithms.

E. TIME COMPLEXITY ANALYSIS
Based on this experiment, the following assumptions are made: N s is the number of self-antigens, N n is the number of nonself-antigens, d is the dimension, M is the number of nonself-clusters, and P is the number of abnormal points. By analyzing definitions 1 through 7, we can get that the time complexity of antigen density clustering algorithm is O((N s + N n ) 2 ). Combined with the detector generation algorithm, the time complexity of this experiment is O((N s + N n ) 2 * d). Compared with other experiments [11], the time complexity is shown in Table 5.
As shown in Table 5, compared with RNSA and V-Detector, the time complexity of ADC-NSA is much lower than the traditional exponential level [14]. Under the same conditions, the time complexity of this algorithm is better than the improved ASSC-NSA. As the dimension d increases, the time complexity of this algorithm will also be better than the improved ASTC-RNSA. Overall, the time complexity of the algorithm is slightly lower.

V. CONCLUSION
The traditional negative selection algorithm ignores the influence of the antigen distribution on the generation of detectors during the detector generation stage, thereby resulting in low efficiency of detector generation. Therefore, this paper proposes a negative selection algorithm that is based on antigen density clustering (ADC-NSA): First, the clustering algorithm is used to identify high-density regions, and the clustered nonself-clusters are directly used as mature detectors. Second, the abnormal points are selected as the centers of candidate detectors preferentially in low-density regions, and detectors are generated via training. Finally, detectors are generated via the traditional algorithm. The algorithm can effectively generate detectors in regions with various densities; hence, the algorithm has a higher detection rate and a lower false-positive rate.
At present, the algorithm still has two problems: 1. In the clustering step, the selection of the cutoff distance d c during clustering still depends on artificial experience; 2. When the detection is performed, the points that fall into the loopholes are not clear. The next research work is how to achieve adaptiveness in the selection of d c and conduct further research on the determination of the data to be detected that falls into the loopholes.
BING-QIU CHEN received the B.S. degree in information and computing science from Huanggang Normal University, in 2017. He is currently pursuing the M.E. degree in computer technology with Hubei University, Hubei, China. His research interests include artificial immune systems and machine learning.
HAI-YANG WEN received the B.E. degree in software engineering from the Hubei University of Economics, in 2018. He is currently pursuing the M.E. degree in computer technology with Hubei University, Hubei, China. His research interests include artificial immune systems and machine learning. VOLUME 8, 2020