Online Clustering of Evolving Data Streams Using a Density Grid-Based Method

In recent years, a significant boost in data availability for persistent data streams has been observed. These data streams are continually evolving, with the clusters frequently forming arbitrary shapes instead of regular shapes in the data space. This characteristic leads to an exponential increase in the processing time of traditional clustering algorithms for data streams. In this study, we propose a new online method, which is a density grid-based method for data stream clustering. The primary objectives of the density grid-based method are to reduce the number of distant function calls and to improve the cluster quality. The method is conducted entirely online and consists of two main phases. The first phase generates the Core Micro-Clusters (CMCs), and the second phase combines the CMCs into macro clusters. The grid-based method was utilized as an outlier buffer in order to handle multi-density data and noises. The method was tested on real and synthetic data streams employing different quality metrics and was compared with the popular method of clustering evolving data streams into arbitrary shapes. The proposed method was demonstrated to be an effective solution for reducing the number of calls to the distance function and improving the cluster quality.


I. INTRODUCTION
A prime application of big data is the Internet of Things (IoT) and its emergence is primarily due to the increase in the number of devices connected to the Internet. All these devices are typically outfitted with various sensors that can accumulate large amounts of data in real-time or several times per minute [1]- [6]. The IoT creates enormous possibilities in several industries, such as health, resource consumption and transportation [2], [7]- [11]. In the realm of IoT, data streams are common in many applications, such as for comprehensive web searching, the real-time detection of anomalies within network traffic, social networks, environmental monitoring, cyber-physical systems and sensor networks. In these applications, data evolve significantly over time and continuously arrive [12]- [16].
In fact, large data are produced continually as data streams from many applications [17]- [19]. The considerable amount The associate editor coordinating the review of this manuscript and approving it for publication was Amir Masoud Rahmani . of data generated by healthcare, social media, devices, sensors and software applications have three forms, namely, structured, unstructured and semi-structured data [20], [21]. The diagnostics necessary for these forms of applications are frequently real-time; therefore, the procedures employed must be equipped with the capacity to impart real-time results [22], [23]. Moreover, data-mining procedures are exceptionally functional for this type of diagnostic [24]- [26].
Data stream mining is a comparatively innovative approach in the domain of data mining [27]- [32]. The monitoring of environmental sensors, investigations of social network issues and the real-time identification of irregularities in computer network transmissions and web searches are among the many areas where this procedure is used [12], [33], [34].
Evolving clustering is an essential data analysis topic for a wide range of applications, such as the following: evolving clusters can be generated in forecasting weather conditions [4], [35]- [37], in earthquake forecasting software based on the analysis of different sources of data from the Earth [38], [39], in intelligent transportation systems for traffic congestion prediction in smart cities [40], in chemistry for forecasting the results of molecular interactions [41], in network intrusion detections (NIDS) [15], [42], [43], in various software performs, and in predicting stock increases or decreases based on their relations with different time series factors [44].
Clustering plays a significant role in the data-stream mining process [42], [45]- [51]. In recent years, many researchers have proposed density-based data stream clustering algorithms. However, several issues related to these clustering algorithms must be considered [13], [52]- [59], such as most are not entirely online methods, are unable to handle evolving data streams, are unable to manage the noisy characteristics of data streams, or suffer from high memory requirements, low processing rates, or the ''curse of dimensionality'' [45], [60]- [63]. Moreover, the existing density-based clustering algorithms have high computational times and low cluster quality for clustering data streams. As such, these algorithms require a large number of distance function calls in order to calculate the distance between data points and micro-clusters.
In this study, our proposed scheme consists of two fundamental steps. In the initial step, the density grid-based procedure is adopted to generate CMCs, upon the occurrence of new data points, in data spaces that are unclustered. The present radius r 0 of the CMCs must be suitably sized to support operational objectives. In this method, a simple linear aging process is used to minimize the life of the CMCs and allows unused CMCs to be removed altogether. Each time new data are received, the CMCs life is usually renewed. In the absence of incoming data, the CMCs will lose a quantity of energy and eventually vanish. If no data are received for a period of time, the CMCs energy will reach zero and be discarded.
The subsequent step entails the integration of any overlapping CMCs into global clusters. CMCs consist of a kernel region and a shell region. The edge CMCs can be discerned by linking the CMCs whose shell regions overlap the kernel regions of other CMCs. CMCs that do not have at least the local density specified by the user (the minimum number of samples within a radius) remain as separate outlier microclusters. Each macro cluster consists of the graph of intersecting CMCs where the adjacency relations for each CMC are stored as a property of that CMC. For convenience, we call the CMCs in adjacency relations (i.e., intersecting CMCs) edges. Using this graph structure reduces the calculations needed to separate clusters if a cluster dies and breaks a chain graph, resulting in two groups of CMCs being no longer connected.
Motivated by this observation, in this paper, we propose the 'Clustering of Evolving Data streams via a density Grid-based Method' (CEDGM). To the best of our knowledge, this is the first article that has presented a clustering approach for the evolving nature of clusters incorporating grid granularity as a data reduction stage to simplify the calculation and eliminate the effect of fine data occurring that does not play any role in the clustering result. Different from [47], our approach performs the clustering operations on the grid-based mapping of the data with adjustable granularity, which allows more efficient operations and avoids outlier effects. Additionally, dissimilar from [48], our approach uses CMC generation in an online model, which means that there is no need to store any data before generating the structure of the CMCs, also in the same iteration of updating CMCs the cluster creation is called. We rely on various sample speeds and times to analyze the efficiency of the CEDGM. The results have demonstrated that the proposed algorithm significantly improves the clustering results compared to Clustering of Evolving Date-streams into Arbitrary Shape (CEDAS) [47] and Cauchy [64]. Our algorithm also has better clustering quality, scalability, and efficiency than existing methods.
The remainder of this paper is organized as follows. Section 2 presents a review of the previous studies. Section 3 defines the principles and the methodology associated with the CEDGM. Section 4 describes the use of datasets for evaluating the efficiency of the recommended algorithm, and Section 5 provides the conclusions drawn from this study.

II. LITERATURE REVIEW
Online or data stream clustering has attracted the attention of numerous researchers and analysts. In clustering data streams, an important issue is how to process this infinite data that are evolving over time or how to maintain the vast amount of data for later processing [24], [47], [48], [65], [66]. The literature has provided numerous methods that include data stream clustering.
In the field of density-based data stream clustering, DBSCAN [67] is considered a primitive algorithm that generates arbitrarily shaped clusters incrementally and is unsuitable for high-dimensional datasets due to it suffering from the curse ofdimensionality [68]. Two density-based clustering algorithms, namely, DenStream [69] and CluStream [6], are other density-based clustering algorithms that summarize the data stream information by storing the temporal locality of data in what is called a micro-cluster. Both algorithms are actively applied to evolving data streams. However, Den-Stream suffers from increased time consumption due to pruning the outlier micro-cluster, whereas CluStream is limited to generating spherical clusters only.
C_DenStream [70], rDenStream [71], SDStream [72], HDDStream [73], and VDStream [74] are all DenStream improvements and are capable of generating arbitrarily shaped clusters. C_DenStream is a semisupervised algorithm in which an expert in the application defines the constraints, although it is unable to handle limited memory issues and cannot handle high-dimensional data streams. rDen-Stream is appropriate for applications where many outlier micro-clusters are produced and it improves the clustering accuracy; however, it has high memory requirements and a high processing time due to processing and saving the historical outlier buffer through a relearning step. A sliding window-based SDStream algorithm can handle evolving data streams and noisy data appropriately. However, the maximum number of micro-clusters is predefined, and SDStream is incapable of handling a high-dimensional data stream set due to it suffering from the curse of dimensionality in its offline stage, which is comparable to DBSCAN.
Another clustering algorithm, HDDStream, can generate high-quality clusters and handle high-dimensional data streams. HDDStream's efficiency is further enhanced by adapting the fading function [69] in PreDeConStream [73] to detect the evolving data streams. However, PreDeCon-Stream suffers from consuming too much time, even though VDStream's clustering accuracy is high. As such, searching for density-reachable micro-clusters requires high processing time. In contrast, SOStream [75] is another density-based clustering algorithm that has gained popularity. This algorithm works by accepting the threshold for density-based clustering to sense the structure associated with the evolution of data streams. Moreover, its online phase allows it to dynamically create, remove and merge clusters. SOStream adopts self-organizing maps, which are a competitive learning technology, and it achieves high-quality clustering while occupying less memory. However, this algorithm suffers from a major drawback, i.e., increased time consumption, thereby making this method unfit for data stream clustering. D-Stream, which is a density-based clustering framework, has been applied to cluster data streams in real-time and was first proposed by [76]. Offline and online phases are involved in D-Stream, in which a new data point is read in the online phase, which is then mapped into a grid. Subsequently, the grid's characteristic vector is updated. Moreover, the clusters are adjusted in the offline phase for each of the time interval gaps. D-Stream also has the ability to cluster data streams in real-time based on the density and the grid. Likewise, handling the outliers in D-Stream is the key motivation for grid density, which regards the outliers as sporadic grids. A sparse grid is a sporadic grid that possesses few data and cannot be transformed into a dense grid. However, high-dimensional data cannot be handled by D-Stream since it considers most of the grids to be empty when a high-dimensional situation occurs. Other current clustering methods, such as the active grid density stream (AGD-Stream) [77], and CLIQUE [78] use density grid decaying technology that identifies active grids, thereby creating clusters through active grids. Here, the data space of the AGD-Stream algorithm is segmented into small cube grids, after which the data object is mapped to this structure. AGD-Stream is time efficient and improves as the stream length increases while achieves high-quality clustering. However, high computational times and large amounts of memory are required for AGD-Stream.
For IoT streams, a density-based clustering algorithm that can be used is hybrid density-based clustering for data streams (HDC-Stream) [17]. This algorithm achieves a quicker processing time compared with its predecessors, thus making it appropriate for a real-time application related to IoT devices. The benefits of micro-clustering and density grid-based methods are incorporated into this method, which can handle outliers and detect arbitrary-shaped clusters. Studies have shown that high-quality clustering results, along with a low processing time, are achieved when working with a data stream. However, the clustering of multi-density data is not feasible and is associated with low memory efficiency [17]. Recently, HDC-Stream was improved by a multi-density data stream (MuDi-Stream) method, which was introduced by [48] in order to resolve the issue of the dramatic decrease in clustering quality when dense data exist. This method is regarded as an online-offline algorithm that incorporates four main components. In the online phase, an information summary of the evolving multi-density data stream is stored in the form of core mini-clusters. The final clusters are generated via the offline phase by applying an adapted density-based clustering algorithm. A hybrid method that integrates the micro-clustering and grid is then applied to store information on the data points. A grid-based method is used to map outliers and form new mini-clusters with various radii.
Four key components are included in MuDi-Stream, which are the components for forming the core mini-clusters, pruning the grids and core mini-clusters, merging or mapping, and forming the final clusters. In the online phase, the first three components are applied, and the last component is associated with the offline phase. M-DBSCAN is another density-based clustering algorithm, which was also suggested for the offline phase, and it forms final clusters with different densities for synopsis data. MuDi-Stream achieves high-quality clustering and occupies less memory. However, the abovementioned streams are inappropriate for high-dimensional data because the number of empty grids will increase and thus will result in a slow processing time.
The algorithms, as mentioned earlier, are either hybrid online/offline or incremental clustering processes. Baruah and Angelov developed two online evolving clustering algorithms for data streams called ELM [79] and DEC [4]. These algorithms require low processing time and provide high cluster purity, but they cannot generate arbitrarily shaped clusters. However, the problem associated with the ELM clustering algorithm lies in its inability to identify a sample's neighborhood given the previously discarded samples. Even though DEC can form hyper ellipsoidal cluster shapes, this technique cannot be adapted to data streams but it can be applied to selecting the optimal radius.
A novel framework for clustering the evolving data stream is known as Cauchy [64]. It is an online learning algorithm that can adapt to the classifier online. In addition, the authors present the idea of designing methodologies for the creation of a cyber-attack detection system. To solve the expensive and time-consuming problems of traditional/offline methods, the method was tested on a 1999 KDD intrusion detection database. The results showed a reduction in the cost of labeling the learning data. Since this approach is online, it is much easier and simpler to include definitions for new attacks than with batch learning algorithms. However, similar to other density-based clustering algorithms, the Cauchy method suffers from a major drawback, i.e., it does not implement the cluster merging mechanism that could decrease the number of created clusters and it is inappropriate for high-dimensional data.
The previous approaches have their strengths and weaknesses; this is true for almost any clustering approach. Hence, the concept of ensemble clustering emerged such as with RCESCC [80], WOCCE [81], RCEIFBC [82]. The ensemble clustering approach calls for using more than one clustering approach at the same time and it fuses or aggregates their results in order to achieve more robust performance [80]- [82]. Such a category of approaches can combine our evolving algorithm with other clustering algorithms as the aggregated approach.
Another online clustering algorithm called clustering online data streams into arbitrary shapes (CODAS) [18] was proposed to allow the formation of arbitrary-shaped clusters via the online clustering of data streams. CODAS is a data-driven algorithm that generates micro-clusters in order to summarize data points and creates a high quality cluster that can be scaled to multidimensional data streams. However, in CODAS, the generated clusters do not evolve. Recently, CODAS was improved by CEDAS [47] by introducing a simple linear aging process to handle the characteristics of evolving data streams. This algorithm is the first fully online clustering algorithm for evolving data streams. CEDAS includes two mean stages. In the first stage, micro-clusters are produced, or data are added to the current micro-clusters, which is followed by information adjustment. The second stage includes intersecting micro-clusters, in which the micro-clusters are grouped into kernel and shell regions. In this technique, each macrocluster includes a graph that demonstrates intersecting microclusters for each micro-cluster. The storing of the adjacency relations is performed as a characteristic of that micro-cluster. CEDAS immediately provides high-quality clustering results and can also handle the properties of an evolving data stream and noise. However, similar to other density-based clustering algorithms, CEDAS consumes considerable computational time. In the next section, we present our developed methodology.

III. METHODOLOGY
In this study, the CEDGM algorithm is proposed to provide high-quality clusters, detect noise, and determine the characteristics of the data point in an evolving data stream. The proposed algorithm uses the data point information to formulate the CMC. Moreover, this clustering algorithm is entirely online and it uses a density grid-based method to reduce the means of calling the distance function.
The present study conducts clustering based on the density grid. This mechanism forms grids by splitting the data space into small segments. Illustrative neighbor research is then performed on the grids to group them into cluster grids. The density grids and data distribution are displayed in Fig. 1. After a comparison with other clustering algorithms, 'clustering based on a grid' provides a fast processing time since it does not rely on the number of data objects, but rather it relies on the number of cells in each dimension. This method is very successful for high density datasets and is robust against noise with almost linear time complexity and distinct arbitrarily shaped clusters.
In the CEDGM, each CMC among radii r 0 /2 contains a shell region r 0 and a kernel region r <= r 0 /2. Macro-clusters are formed by intersecting the shell region of CMCs and the kernel regions of other CMCs. The CMCs with a density that exceeds the minimum threshold but with no intersections are also considered macro-clusters. From the data stream, a new data point will fall into three regions. First, if the data point falls in the empty space of a grid granularity, it will then create a new outlier. Second, if the data point falls in the shell region of a CMC, then it can be assigned to the cluster, and the CMC center and cluster count will be recursively updated. Third, the data point allocated to the CMC and the cluster count is updated when the data point falls in a kernel region. The created or modified CMC is examined to determine if the cluster density is greater than the minimum threshold. This CMC is then examined for new intersections with other CMCs. When new intersections are created, these CMCs are linked and assigned to the same macro-cluster. All connected CMCs must have the same macro-cluster and create an arbitrarily shaped cluster in an online manner.

A. PROBLEM FORMULATION
We assume that we have time-series data (xt, yt), where t = 1, 2, . . . , t, x t = (x t 1 , x t 2 , . . . , x t m ) ∈ R m , y t ∈ {1, 2, . . . , Nc t } and Nc t denotes the number of clusters at moment t. The clusters C i are defined as C i = {C t i }, which means that the VOLUME 8, 2020 cluster is defined based on the points that belong to it at each time unit, or in other words, the shape of the cluster changes concerning time. The problem seeks to partition the data into its corresponding clusters to obtain the least difference between the predicted clusters and the actual ones. That is, it seeks to find the predicted y p t for each x t such that y p t = y t for most samples. The difference between the classical clustering problem and the evolving one is dynamic, which is considered in the result of the evolving problem. The cluster in classical clustering is not dependent on time C i , while in the evolving problem, it is dependent on time C i = {C t i }.

B. PRELIMINARIE
In this section, we introduce the CEDGM. The following terminologies are used in the CEDGM algorithm.

1) GRAPH OF CLUSTERS
This structure illustrates the building of the macro-clusters by intersecting CMCs. An edge will collectively record the intersection of each CMC with the suitable CMC assignment in 'Macro'. Two CMCs are considered edged if the kernel region of a CMC intersects with the shell region of another CMC. Mathematically, two CMCs with radii R 1 and R 2 are considered intersected if the (d) distance between the centers is less than or equal to the intersecting distance (R 1 + R 2 /2). An example of the relationship between the graph structure and CMCs is illustrated in Fig. 2(a).

2) CORE MICRO-CLUSTERS (CMCS)
They are defined as the primary entities that construct the cluster. It is a data structure that includes two properties: the center and density. All CMCs have the same radius according to the minimum allowable density. At time t, a CMC is described as a set of close data points X 1 , X 2 , X 3 , . . . , X Nt in a high-density area wherein the local density N t is equal to or exceeds the threshold (N t ≥ Th density ). An example of the data points and CMCs is depicted in Fig. 2(b).

3) OUTLIERS
They are defined as the group of one or more data points X 1 , X 2 , X 3, . . . , X Nt in a low-density area at time t, where the local density N t is less than a predefined threshold (N t < Th density ).

4) SAMPLE
It is a streaming data point within d dimension.

5) GRID DIMENSION
It is how many sub-segments are considered in a certain dimension to define the coordinates of this dimension. Assume that in dimension d, the minimum value is V min , and the maximum value is V max . Then, the resolution is defined as follows: Graph structure of the CEDGM algorithm and the core micro-clusters. The graph structure with subgraph nodes is demonstrated in Fig. 1(a). The data with core micro-clusters are exhibited in Fig. 1(b).
The resolution is already calculated by the equation provided in Eq. 1. To elaborate more, we have to first set the value of the grid granularity, which is selected based on a tuning process. A smaller grid granularity implies more calculations and sensitivity to low-frequency noise while a high grid granularity implies less sensitivity in the data changes before clustering. Hence, we set a suitable value depending on the data. In the experiment, we selected a grid granularity of 30.

C. DESCRIPTION OF THE PROPOSED CEDGM ALGORITHM
Before the implementation of the proposed CEDGM algorithm, a few application-dependent parameters are described based on the expertise of the application comparable to other density-based clustering algorithms, such as CluStream, Den-Stream, Cauchy, CODAS, DEC, MuDi Stream and CEDAS. The CEDGM algorithm requires several parameters to be performed.
The parametric values depend on the applications as follows: 1. Decay: This parameter denotes how many last samples we consider for processing at the current time t. If the decay is set to N and we are at moment t, then we consider samples t, t − 1, t − 2, . . . , t − N + 1. 2. Fade: It indicates the time in which a CMC has to be removed if no new point or sample was added to it within this time. It is calculated as the inverse of the decay, as shown in Eq. (2): where FD denotes the fade, and DC denotes the decay. 3. Radius: This parameter is the minimum allowable distance for a sample to be from the center of a CMC to still belong to it. Otherwise, it belongs to either an outlier, or it creates a new outlier. 4. Minimum Threshold: This parameter is the minimum number of data points that are required to form a CMC or convert an existing outlier to a CMC. 5. Grid Granularity: It is an important parameter for the computational time of the CEDGM algorithm, and it represents the difference in the counting structure nodes used for grids. The CEDGM is a new algorithm for discovering the clusters of evolving data streams in multi-density environments. This algorithm maintains a summary of information on evolving data streams in the form of CMCs. A grid-based method is used as an outlier buffer to handle noises and multidensity data and reduce the number of distance function calls. After setting the application parameters, the proposed CEDGM algorithm is executed on the data stream {X 1 , X 2 , X 3 , . . . , X m } using the following steps: • Assign the core micro-clusters, • Kill the weak core micro-clusters, and • Update the cluster graph. The characterization of each step is provided next. For each data sample, the algorithm executes the three steps sequentially.

1) ASSIGN THE CORE MICRO-CLUSTERS
This part of the algorithm uses a density grid-based method in which the data space is partitioned into small segments called grids. The segments and intersection points in the standard grid are called cells and nodes, respectively. Each data point X i in the datastreams are mapped into a grid, and the grids are clustered based on their density. When a new data point arrives, the algorithm determines the CMCs and outliers in the grid and its neighbors.
Step 1 shows a process for assign core-micro clusters.
Next, the algorithm checks the incoming data points that belong to any existing CMCs or outliers. If the distance (d) between the point and the nearest outlier or CMC is less than the radius, it is expressed as given in Eq. 3. Then, the CMC or outlier is updated and the grid coordinates are determined; Step Update cluster graph % go to step 3 Edges = determine Edges (CMC) End otherwise, a new outlier is created. Further verification is conducted to determine if the update unit is an outlier and if the count is larger than the minimum threshold. Then, the algorithm will promote the outlier to a CMC. The new CMC must be assigned to the cluster of the nearest CMC in the shell by determining the edges. Fig. 3 shows the flowchart for assign the core micro-clusters.

2) KILL THE WEAK CORE MICRO-CLUSTERS
This part of the algorithm minimizes the life of CMCs and removes them when their life is below zero. The life of CMCs is reduced using the fade. When a CMC is removed, all edges that refer to it will also be removed, and the total number of CMCs will be decreased.
Step 2 shows a process for kill core-micro clusters. Fig. 4 shows the flowchart for kill the core micro-clusters.

3) UPDATE THE CLUSTER GRAPH
Like other density-based algorithms, such as CEDAS and CODAS, a clustering graph is maintained in order to create an online macro-cluster. The clustering graph makes a change if either of the following occurs. VOLUME 8, 2020 Step  • A new link has been created when a new CMC arrives or when old CMCs move. For each new link connecting two CMCs from different clusters, the link ID is added to the cluster table.
• An old link has been removed (when a CMC is removed or moved).  Step 3 shows a process for update the cluster graph. The changes are made to any CMC that has been modified when either its center location is moved or by being in a CMC that has newly reached the threshold. In this case, the graph edges may have changed. If the edge list has changed, then the new graph has its macro-cluster number set to a new value. Fig. 5 shows the flowchart for update the cluster graph.

D. COMPUTATIONAL COMPLEXITY
Computational time mainly involve two sub-processes i.e., cluster assignment and CMC update. Cluster assignment is performed based on the Euclidean distance between the arriving data point and the current CMC centre. Time complexity is O (ND) , where N is the total number of data points clustered and D is the number of dimensions.
This algorithm checks the arriving data point that belongs to any existing CMC or outlier. Otherwise, the data point will be mapped to the grid and a new outlier will b created. In the CEDGM algorithm, the grid is implemented as a tree that allows fast lookup, update, and deletion. The key feature

IV. RESULTS AND DISCUSSION
This section analyses and compares the performance of the CEDGM algorithm with those of the CEDAS [47], Cauchy [64], MR-Stream [60], WOCCE [81] RCESCC [80], RCEIFBC [82], DBSCAN [67], and CLIQUE [78]. These algorithms are implemented in MATLAB R17a, and their performances are evaluated on a PC with an Intel Core i5 processor @ 2.66 GHz and 16.0 GB of RAM. The parameters of the proposed and other algorithms are Decay = 1000, Minimum Threshold = 4 data points, Radius = 0.05, and Grid Granularity = 30. This experiment aims to verify the average number of distance function calls, average purity and average accuracy across sample speeds and times. For logical samples of numbers, the decay is set to ensure appropriately sized macro-clusters for demonstrating the efficiency of the technique. The minimum threshold of CMCs is set to 4. The radius is constantly set small to ensure multiple CMCs. The grid granularity is set to 30, higher grid granularity causes a higher number of children for each node in the tree, which leads to better results. We introduce datasets that are used to determine the efficiency of the suggested algorithm in handling evolving data streams. Real/synthetic datasets are used to assess the proposed algorithm.
A network intrusion detection dataset (KDDCUP'99) is a real dataset for testing the performance of evolving clustering algorithms and contains TCP connection logs from 2 weeks of LAN traffic. The dataset comes from the 1998 DARPA Intrusion Detection dataset. It includes training data consisting of 7 weeks of network-based intrusions inserted in the normal data and 2 weeks of network-based intrusions and normal data for 4,999,000 connection records described by 42 characteristics. Each record corresponds to either a normal connection or an attack. All 34 continuous attributes of the KDD CUP '99 are used, as in [46]- [48], [69], [75], [83]. A conversion of the dataset into data streams is conducted by taking the data input order as a streaming order. In addition, three datasets were used from the UCI machine learning repository [84] i.e., Half-Ring has 373 data points, Iris has 150 data points, and Galaxy has 323 data points. Fig. 6 plots the synthetic datasets used called DS1 and Spiral. DS1 has 9,199 data points [85]- [87], and Spiral has 6,012 data points. The class labels are known for all the datasets used in our experiments. Therefore, the quality of the clustering obtained is assessed by considering outstanding external criteria, namely, the average number of distance function calls, the average purity and the average accuracy.
1. Purity: For each cluster, purity is the class most frequently divided by the number of data points in that cluster [47], [88], [89]. Purity may be used for the clustering analysis of data streams in various studies and is defined as: where n d i is the dominant class sample, n i is the number of samples that a cluster contains, and N is the number of clusters. 2. Accuracy: To evaluate the data stream-clustering, the accuracy is used and can be defined as the number of samples in a cluster that belong to that cluster and do not belong to any other cluster [47], [88], [89].
where n d i is the dominant class sample, n i is the number of samples that a cluster contains, and N is the number of clusters.

Normalized Mutual Information (NMI): NMI is derived
from entropy in information theory. For a discrete random variable P * , which measures the mutual dependence between P * and L t [80]- [82], the NMI is defined as follows: where P * is the consensus partition, L t is the ground-truth of the dataset, n is the total number of data points in the given dataset X , n ij is the number of data points in the intersection of the i th cluster of P * and the j th cluster of L t , n * i and n t j are the number of data points in the i th cluster in P * and the number of data points in the j th cluster in L t , respectively.

A. NETWORK INTRUSION DETECTION
The comparisons between the CEDGM, CEDAS, Cauchy, and MR-Stream algorithms on the data stream set for network interruption detection are displayed in Fig. 7. The outcomes are calculated at different sample speeds and times, where the parameters of the proposed and other algorithms are Decay = 1, 000 samples, Radius = 0.05, Minimum Threshold = 4 and Grid Granularity = 30. The result of the average number of distance function calls is compared at different sample speeds and times on the high dimensional KDDCUP'99 dataset, which is precisely the same test dataset used in [47]. When the sample speed varies from 5 pits per second (PPS) to 25 PPS, the average number of distance function calls increases, as illustrated in Fig. 7(a). The CEDGM algorithm consistently has a lower average number of distance function calls compared to the CEDAS algorithm. When the sample speeds are 15 and 25 PPS, the CEDGM exhibits average numbers of distance function calls of 1,030 and 1,472 compared with the CEDAS algorithm, which has values of 1,201 and 1,665, respectively.
However, when the time varies from 100 s to 500 s, the average number of distance function calls increases, as illustrated in Fig. 7(d). The average number of distance function calls of the CEDGM exceeds that of the CEDAS. For example, with times of 100 s and 500 s, the average numbers of distance function calls for the CEDGM are 6,396 and 13,448 compared with the averages of 8,444 and 17,108 for the CEDAS algorithm, respectively. The CEDAS algorithm can detect outliers and arbitrarily shaped clusters; however, when analyzing the membership of arriving data points regarding any current CMCs, CEDAS calculates the distance for the new data point with all other visibility agents. Many Euclidean distance calculations are therefore required. By contrast, the CEDGM algorithm has a lower average number of distance function calls because when using grid-based density algorithms, not only can outliers and arbitrarily shaped clusters be located but also fast processing times can be achieved. In other words, they do not depend on the number of data objects but on the number of cells in the quantized space in each dimension.
The comparison between the CEDGM, CEDAS and Cauchy algorithms in terms of the average clustering purity is depicted in Fig. 7(b). The average purity, as defined in Eq. (4), is determined by the sample speed. Our proposed algorithm has a higher purity than those of the CEDAS and Cauchy algorithms. For example, one cluster appears in one window at speeds of 5 and 25 PPS, and the CEDGM algorithm achieves average purity values of 99.88% and 99.83%, respectively. This result indicates that nearly all samples are adequately empowered to be the predominant clusters. The average purity values of the CEDAS algorithm at sample speeds of 5 and 25 PPS are 99.63% and 99.47%, respectively, and those of Cauchy are 98.97% and 99.05%, respectively, considering the few misallocated samples in clusters with few numbers. Fig. 7(e) demonstrates the results of the mean purity analysis for the time of 25 s. We use this estimation that is favored by Hyde and observe that the average purity of the CEDGM surpasses those of the CEDAS, MR-Stream and Cauchy. The two periods of 50 s and 150 s are used. We determined that the average purity values of the CEDGM are 100% and 99.29% compared with the values of the CEDAS of 98.29% and 96%, respectively; the values of MR-Stream of 90% and 82.05%, respectively; and the values of Cauchy of 98.66% and 98.75%, respectively. We then examine the results at periods of 475 s and 500 s and find that the average purity of the CEDGM is 100%.
The following reasons contribute to the improved performance of the CEDGM algorithm. 1. A new CMC that can capture the characteristics of a mixed data object and is accurately distributed is introduced. This feature makes the cluster purity of the CEDGM algorithm increasingly more accurate. 2. The CMC maintenance mechanism of the online CEDGM algorithm can remove the outliers in time and cluster the potential CMCs, thereby enhancing the clustering purity. The experimental results of the average clustering accuracy of the CEDGM and CEDAS algorithms are exhibited in Fig. 7(c). The average clustering accuracy of the proposed algorithm outperforms that of the CEDAS and Cauchy. When the sample speed varies from 5 to 25 PPS, the average accuracy of the CEDGM is constantly higher than 91%, whereas those of the CEDAS and Cauchy algorithms are less than 89%. When the sample speed is 25 PPS, the average accuracy of the CEDGM algorithm is 92.02%, whereas that of the CEDAS algorithm is 86.34% and Cauchy is 73.08%. The results of the 25 s period, as displayed in Fig. 7(f), are also used as a reference. For the period of 25 s to 125 s, the average accuracy of the CEDGM algorithm is 100% compared with those of CEDAS and Cauchy that are less than 78%. The CEDGM algorithm is based on the density grid of the enhanced clustering. When calculating the density of grid cells, the impact of the boundary data points on the grid is calculated rationally by calculating the impact coefficient of the added data points on the adjacent grid cells. Thus, the data points cannot be treated as noise points, and this characteristic improves the clustering accuracy. Table 1 includes the performances of different state-of-the-art methods compared with the proposed method in terms of the sample speed using the ''KDD CUP'99'' dataset as validated by the paired t-test. Meanwhile, Table 2 includes the performances of different  varies from 15 to 100 PPS, as presented in Fig. 8(a), and the 25 s period is used, as illustrated in Fig. 8(d), the CEDGM algorithm outperforms the CEDAS algorithm regarding the mean number of distance function calls. At sample speeds of 15 PPS and 100 PPS, the average numbers of distance function calls of the proposed algorithm are 306.06 and 315.13, respectively, compared with those of CEDAS of 377.39 and 370.22, respectively. At a period of 175 s, the CEDGM reaches 389.72, whereas the CEDAS reaches 503.18. We conclude that the CEDGM algorithm can yield a lower average number of distance function calls than the CEDAS algorithm due to the same reason of improving the average of distance function calls (Section 4.1).
The experimental results of the CEDGM, CEDAS and Cauchy algorithms in terms of the average clustering purity are depicted in Figs. 8(b) and (e). We test them on the same 'spiral' evolving data stream. The average purity, as defined in Eq. (1), is determined by sample speed and time. Our proposed algorithm has a higher purity than the CEDAS and Cauchy algorithms. At a speed of 15 PPS, the CEDGM algorithm achieves an average purity of 83.47%, whereas CEDAS reaches 71.89% and Cauchy reaches 75.05%. Regarding the average purity at periods from 25 s to 225 s, both algorithms achieve 100% from 25 s to 100 s; meanwhile, at a period of 225 s, the CEDGM reaches 61.15%, whereas the CEDAS reaches 50.02% and Cauchy reaches 59.52%. Thus, nearly every sample is appropriately allocated to the predominant clusters. We conclude that the CEDGM algorithm can obtain better average clustering purity than the CEDAS and Cauchy algorithms due to the same reason of improving the average clustering purity (Section 4.1).
Figs. 8(c) and (f) depict the average accuracy results of the CEDGM, CEDAS and Cauchy algorithms on the 'spiral' evolving data stream. The average accuracy, as defined in Eq. (2), is determined by the sample speed and time. When the sample speed varies from 15 to 100 PPS, as demonstrated in Fig. 8(c), our proposed algorithm exceeds the CEDAS and Cauchy algorithms in terms of the average accuracy. At a sample speed of 15 PPS, the average accuracy of the CEDGM algorithm is 77.58%, whereas that of CEDAS is 71.87% and that of Cauchy 72.73%. At a sample speed of 50 PPS, the clustering accuracy of the CEDGM is 79.26%, whereas that of the CEDAS is 74.99% and that of Cauchy 68.75%. The results of the 25 s period, as exhibited in Fig. 8(f), are also adopted. At a period ranging from 25 s to 225 s, both algorithms achieve 100% from 25 s to 100 s; meanwhile, for the period of 225 s, the CEDGM reaches 52.48%, whereas the CEDAS reaches 49.99% and Cauchy reaches 50.33%. This result confirms that the CEDGM algorithm has better average clustering accuracy than the CEDAS and Cauchy algorithms due to the same reason of improving the average clustering accuracy (Section 4.1). Table 3 includes the performances of different state-of-the-art methods compared with the proposed method in terms of sample speed using the ''Spiral'' dataset as validated by paired t-test. In addition, Table 4 includes the performances of different state-of-the-art methods compared with the proposed method in terms of time using the ''Spiral'' dataset as validated by the paired t-test.

C. DENSITY DATASET (DS)
A comparison between the CEDGM, CEDAS and Cauchy algorithms on the DS is displayed in Fig. 9. The outcomes are calculated at different sample speeds and times, where the parameters of the proposed and other algorithms are Decay = 500 samples, Radius = 0.05, Minimum Threshold = 4 and Grid Granularity = 30. The CEDGM has a lower average number of distance function calls on this dataset. For example, when the sample speed varies from 5 to 25 PPS, as presented in Fig. 9(a), the average number of distance function calls of the CEDGM is less than 63, whereas that of the CEDAS exceeds 63. When the time varies from 25 s to 125 s, as illustrated in Fig. 9(d), the average number of distance function calls is constantly less for the CEDGM algorithm than for the CEDAS algorithm. Therefore, the CEDGM algorithm performs better in terms of the average number of distance function calls because when using the density gridbased method, the average number of distance function calls is lower compared with the CEDAS algorithm due to the same reason of improving the average number of distance function calls (Section 4.1).
Figs. 9(b) and (e) depict the average purity results of the CEDGM, CEDAS and Cauchy algorithms on the same DS. The average purity, as defined in Eq. (1), is determined the sample speed and time. When the sample speed varies from 5 to 25 PPS, the average purity of the CEDGM is higher than 98%. When the time varies from 25 s to 150 s, the average purity of the CEDGM is higher than 99%. The CEDAS achieves 100% for the time from 25 s to 50 s but it is less than 93% when the time ranges from 75 s to 150 s and that of Cauchy is below 83%. We conclude that the CEDGM  The CEDGM algorithm is based on the density grid of the enhanced clustering algorithm. When calculating the density of grid cells, the impact of the boundary data points on the grid is calculated rationally by calculating the impact coefficient of the added data points on the adjacent grid cells. Thus, the data points cannot be treated as noise points, and this condition will improve the clustering accuracy. Table 5 includes the performances of different state-of-the-art methods compared with the proposed method in terms of the sample speed using the ''DS'' dataset validated by paired t-test. Meanwhile, Table 6 includes the performances of different state-of-the-art methods compared with the proposed method in terms of time using the ''DS'' dataset as validated by the paired t-test.  ,  TABLE 6. Performances of different state-of-the-art methods compared with the performance of the proposed method, i.e., CEDGM, in terms of time using DS as validated by the paired t-test.  Radius = 1, Grid Granularity = 30, minPts = 2 data points, eps = 1, and number of interval = 5. In the three datasets such as 'Half-Ring, Iris, and Galaxy', the proposed algorithm has better average clustering accuracy and average NMI than the other existing algorithms due to the same reason  of improving the average clustering accuracy (Section 4.1). Table 7 summarizes the results of the average clustering accuracy, and Table 8 summarizes the results of the average NMI.

V. CONCLUSION
An enhanced method for clustering evolving data streams is introduced in this paper using a density grid-based method VOLUME 8, 2020 called the CEDGM. The main idea of the CEDGM algorithm is to use a density grid-based method to improve the clustering quality. This technique is used to reduce the number of distance function calls and handling outliers. The CEDGM can handle evolving data streams online. This algorithm is compared with a familiar technique in terms of the average number of distance function calls, the average purity and the average accuracy. The proposed algorithm is particularly efficient if data streams are constantly evolving.
The CEDGM is tested using different data streams and is confirmed to be capable of accurately discovering anomalies within specified periods. It further demonstrates its capability to generate high-quality clusters in practical network attacks in the KDDCUP'99 data stream. Extensive evaluations of different synthetic and real datasets using different quality metrics shows that the clusters generated in the CEDGM are pure and more accurate compared to similar existing clustering algorithms due to summarizing the data points into a grid and generating CMC. Nevertheless, it has a low average number of distance function calls due to the grid-based method and does not depend on the number of data objects, but rather it depends on the number of cells in the quantized space in each dimension. In summary, the CEDGM is an accurate technique with a lower average number of distance function calls across different data stream speeds and times. However, one of the limitations of the CEDGM algorithm in high dimensional data is the non-effectiveness of the grid in reducing memory consumption, especially with high dimensional data with sparse nature. Our future work will focus on the improvement of the memory consumption of the CEDGM algorithm in high dimensional datasets with sparse nature.