An Automatic Merge Technique to Improve the Clustering Quality Performed by LAMDA

Clustering is a research challenge focused on discovering knowledge from data samples whose goal is to build good quality partitions. In this paper is proposed an approach based on LAMDA (Learning Algorithm for Multivariable Data Analysis), whose most important features are: a) it is a non-iterative fuzzy algorithm that can work with online data streams, b) it does not require the number of clusters, c) it can generate new partitions with objects that do not have enough similarity with the preexisting clusters (incremental-learning). However, in some applications, the number of created partitions does not correspond with the number of desired clusters, which can be excessive or impractical for the expert. Therefore, our contribution is the formalization of an automatic merge technique to update the cluster partition performed by LAMDA to improve the quality of the clusters, and a new methodology to compute the Marginal Adequacy Degree that enhances the individual-cluster assignment. The proposal, called LAMDA-RD, is applied to several benchmarks, comparing the results against the original LAMDA and other clustering algorithms, to evaluate the performance based on different metrics. Finally, LAMDA-RD is validated in a real case study related to the identification of production states in a gas-lift well, with data stream. The results have shown that LAMDA-RD achieves a competitive performance with respect to the other well-known algorithms, especially in unbalanced benchmarks and benchmarks with an overlapping of around 9%. In these cases, our algorithm is the best, reaching a Rand Index (RI) >98%. Besides, it is consistently among the best for all metrics considered (Silhouette coefficient, modification of the Silhouette coefficient, WB-index, Performance Coefficient, among others) in all case studies analyzed in this paper. Finally, in the real case study, it is better in all the metrics.

Clustering, unlike supervised learning, is useful in problems where unlabeled data is available [1]. The aim of clustering is to separate data into partitions with elements that have similar characteristics between them. Each cluster must be separable and compact, with respect to another cluster [2]. In the literature are reported different clustering approaches, some of them are distance-based [3], [4], partitioning clustering, hierarchical clustering [4], density-based [5], [7], fuzzy logicbased [6]- [9], or Gaussian methods [10], [11], among others. All these techniques depend on a previous stage of descriptor extraction, which are later used for the individual-cluster assignment performed by the algorithm. Historical data and streaming data [12] are application scenarios of the clustering techniques, but not all the methods can work in both contexts because the data is obtained differently; in the first case, the complete database is available, while in the second case, new data arrives continuously. The importance of working in the context of data streaming is that the evolution patterns provide useful information, which can allow users to make immediate and correct decisions [13]. In classical clustering, data is assigned to exactly one cluster; on the other hand, fuzzy clustering methods are based on the fuzzy membership degree; therefore, an individual can be a member of several clusters. Fuzzy clustering is widely used in the field of machine learning [2], [14]- [17]. One of these methods is LAMDA (Learning Algorithm for Multivariable Data Analysis) [18], which requires the computation of two parameters to assign an individual to a cluster. The first one is the Marginal Adequacy Degree (MAD), which computes the contribution of all descriptors (attributes) of an individual using fuzzy probability functions [19]. Applying fuzzy aggregation operators to the MADs is computed the second parameter, called Global Adequacy Degree (GAD), which corresponds to the membership degree of the individual to a cluster. The GADs define the cluster where the individual is assigned.
LAMDA can work in supervised (classification) and unsupervised learning (clustering), which is an important advantage over other fuzzy techniques [20], [21], [61]. In unsupervised learning, the algorithm can create new clusters automatically. The creation of new clusters is based on a threshold known as Non-Informative Class (NIC). For each sample, the GADs are computed in each cluster and in the NIC. The cluster with maximum GAD is where the individual is assigned, but if the maximum GAD is the one corresponding to the NIC, then a new cluster is created.
Conventional iterative methods (based on prototypes), e.g. Fuzzy C-Means (FCM) [22], Gustafson & Kessel means (GK-means) [23], K-Nearest Neighbors (KNN), require the determination of the number of clusters K. The definition of K is not easy, and it is an open field of research, as well as the determination of the centers of the clusters [24]. In LAMDA, it is not necessary to know as an input parameter the number of partitions K which is a great advantage over the aforementioned methods. However, in the cases in which data have high levels of intra-cluster uncertainty, a large amount of undesired partitions could be created due to the comparison made with the NIC. This limitation of LAMDA was detected by observing the existence of several poor quality clusters [19], being important to propose solutions to improve this problem, especially in the computation of MADs, and incorporating an automatic merge of the clusters when it is required.
The motivation of this work is to propose an automatic merge technique to update the cluster partition performed by LAMDA to enhance the clustering quality. The proposed extension avoids the excessive and undesired cluster creation. Also, we propose a new method for the MAD computation based on a penalty factor, in order to improve individualcluster assignment. The two improvements described above result in the LAMDA-RD algorithm, which solves the main problems that the original algorithm has when it is working in unsupervised learning. This paper has been organized as follows: Section II briefly reviews the relevant literature related with LAMDA and its research advances in clustering, Section III shows the fundamentals of LAMDA, Section IV formalizes our approach for LAMDA in order to improve the calculation of the MADs and the implementation of the automatic merge algorithm. In Section V are presented the experiments and the statistical analysis, applied in different benchmarks and scenarios, comparing our method with other clustering algorithms. Also, the proposed extensions of LAMDA are tested in a real case study: a streaming data scenario corresponding to the identification of the rate of production in gas lift wells. Section VI presents a general analysis of the results, and finally, conclusions and further works are presented in Section VII.

II. RELATED WORKS
In the context of unsupervised learning, LAMDA has been used to identify the functional states that describe the behavior of systems, for instance, the coagulation process in water plants [25]- [27]. In these applications, the algorithm has identified eight functional states of normal and abnormal functioning of the plant, which allows a constant monitoring of the process, in order to take corrective actions when abnormal states are detected. In [28], LAMDA method is applied for monitoring complex industrial processes, combined with Markov's theory, which allows identifying the connections between functional states (clusters) through a transition degrees matrix. Other application fields are the electrical distribution networks, to solve problems of fault detection [29], discovering information from the data. Also, this method has been used in computer vision applications [25], [30], [31], and its performance has been computed based on tests with different aggregation operators and fuzzy probability distributions. Finally, one of the most important contributions of LAMDA is its implementation in a software for the supervision of complex systems. This software is called SALSA (Situation Assessment using LAMDA classification Algorithm) [33]. SALSA has been used for functional state detection in several applications, such as those presented in [34], [35].
Different researches have proposed modifications to the original LAMDA in the field of classification and clustering. Specifically, in clustering tasks, the most important recent contributions are: -''LAMDA Triple Pi (π) operator (LAMDA-TP)'' [19], [36].
This operator is used in LAMDA as an aggregation function for the computation of the GADs, avoiding the creation of new clusters with a few individuals. -''LAMDA clustering method based on typicality degree and intuitionistic fuzzy sets'' [2]. The authors propose the calculation of three functions: the Global Typicality Degree (GTD), the Intuitionistic Global Adequacy Degree (IGAD), and the Typicality and Intuitionistic Global Adequacy Degree (TIGAD). This proposal is applied in some study cases, presenting the formation of good clusters. The algorithms described above have some drawbacks in the cluster formation. In the first case, LAMDA-TP does not depend on the exigency parameter α (formalized in the original LAMDA), which allows calibrating the permissiveness of the algorithm. In other words, it is a control parameter linked to the quality and number of created clusters. LAMDA-TP performs the clustering process based only on the similarity computed by the triple π operator, and the user cannot calibrate the algorithm partitions. LAMDA based on intuitionistic fuzzy sets improves the clustering stage; however, in [2], a comparison of the algorithm with respect to other similar methods is not presented, and based on the results, it is observed that a merge stage is required to group clusters of similar characteristics, in order to obtain best models. In addition, the formed partitions are not analyzed in terms of performance metrics, which allow evaluating their intra and inter-cluster qualities.
Finally, the original LAMDA has the exigency parameter α used for the calibration, however, it does not guarantee the creation of good quality clusters. If this value is close to 1, then an excessive number of poor quality clusters are formed. Therefore, it is necessary to reinforce the algorithm, in order to improve the quality of the resulting partitions.
Merge techniques allow generating clusters with better intra and inter-cluster characteristics, by merging similar partitions based on certain conditions. Recent works [37]- [39] have addressed these algorithms obtaining significant improvement in cluster construction. Therefore, the main objective of our proposal is to improve the cluster partitions based on the implementation of a merge algorithm, and a new proposal for the MAD computation, which we have identified as the main problems of LAMDA.

III. LEARNING ALGORITHM FOR MULTIVARIABLE DATA ANALYSIS ''LAMDA''
LAMDA is a fuzzy method based on the concept of the adequacy degree. The versatility of this algorithm lies in the fact that it does not require the number of clusters as an input parameter [19], and it is a non-iterative method that can work online, being these the main advantages in the unsupervised learning context. The classic algorithm performs a similarity evaluation of the descriptors of a sample X and the clusters C = {C 1 ; C 2 . . . ; C k ; . . . ; C m }, where m is the number of pre-existing clusters not defined by the user [18], to define where the sample should be assigned.
In order to explain the fundamentals of LAMDA, several definitions and statements are formalized in this paper.
Statement 1: Let X be an unlabeled sample (individual), represented by a vector of n descriptors [12]: where: x j is the descriptor j of the object X Statement 2: Letx j be the normalized descriptor x j . The normalization is computed with the maximum x jmax and minimum x jmin limits of that descriptor as: The normalized sampleX is used to compute the adequacy degrees in each cluster.
Definition 1 (Marginal Adequacy Degree (MAD)): This parameter computes the similarity between the descriptor of a sample and the same descriptor in each cluster. For the MAD computation, probability density functions are used, specifically, fuzzy binomial function [40]: where ρ k,j (t) is calculated using Eq. (4), and it is the mean value of the descriptor j in the previously created cluster k. It is updated progressively each time that a new element is added. n k (t − 1) is the number of objects previously assigned to the cluster k.
The value of MAD k,j represents the adequacy degree ofx j in the same descriptor of the cluster k. In Eq. (3), if ρ k,j (t) = 0.5, then MAD k,j x j | ρ k,j = 0.5 for any value ofx j . This MAD value is considered for the Non-Informative Class (NIC), MAD NIC,j x j | ρ NIC,j = 0.5.

Definition 2 (Global Adequacy Degree (GAD)):
The adequacy of a sample to each cluster is obtained with the Global Adequacy Degree (GAD), which is computed by mixing the MADs with aggregation functions. These are linear interpolations between the t-norm and t-conorm, like the Dombi operator (Eq. (5) and (6)) [41]. This operator presents a good performance for class generation, and it has been used in the evaluation of the maximum variation of clustering (see more details in [2]).
Generally, in the literature p ≥ 1. We choose p = 1, in order to obtain a close approximation to a linear behavior of the t-norm and t-conorm [41].
The exigency parameter 0 < α < 1 is used to calibrate the fuzzy partition data [42]. In Eq. (7), if α = 1, then the fuzzy partition data is computed by the t-norm. It means that the clustering is stricter, therefore, more objects will be unrecognized (sent to the NIC), creating more new clusters. If α = 0, then the fuzzy partition data is computed by the t-conorm. It means that the clustering is more permissible, therefore, samples are assigned to a cluster, despite not having enough similarity with the samples belonging to it. α produces a linear interpolation between t-norm and t-conorm for the GAD [43].
The GAD of the NIC is computed considering MAD NIC,j = 0.5, regardless of the value ofx j , that is: Statement 3: Let index (in) the number of the identifier of the cluster where the sample X has the highest membership degree, that is, the highest GAD, established by: If the object is assigned to the NIC, then it becomes the first element of a new cluster C new , The above is represented as: The LAMDA implementation is described in the pseudo-code presented below. The process starts normalizing each descriptor of the sample X . Next, the MADs are computed for each descriptor in each cluster using the fuzzy binomial function. With the MADs, the calculation of the GAD in each cluster is performed with fuzzy connectors considering the value α for the exigency. Finally, the cluster with the higher value of the GAD is where the sampleX is assigned. If the higher GAD is the GAD NIC , then it is considered that the sample does not belong to any cluster, and it is sent to the NIC to create a new partition.

Algorithm 1 LAMDA for Unsupervised Learning
Input: SampleX Procedure: 1. Normalize the sampleX using Eq. (2). 2. MAD k,j ← x j | ρ k,j 3. GADX ,k ← MAD k,1,...MAD k,n 4. Identify the index in of the cluster with: in = arg max GADX ,1 , GADX ,k , . . . , GADX ,m , GADX ,NIC 5. if the highest GAD is the corresponding to the NIC, 6. then the sample is the first element of the new cluster C m+1 , as is shown in Eq. (10). 7. else the object is assigned to the cluster with the highest GAD, and update ρ k,j (t) using Eq. (4). 8. End. Output: cluster updated or created by the algorithm

IV. PROPOSED APPROACH
In [2], [19] has been shown that LAMDA creates clusters that do not correspond with the number of desired groups. Clusters with a high degree of similarity should be merged in a single cluster, according to a similarity measure. Thus, the algorithm should automatically decide when a merge process between clusters is required. For that, it is proposed to hybridize the original algorithm with distance measurements, in order to improve the quality of the clusters. The split task is considered as an intrinsic LAMDA feature, because it can create new groups from the global adequacy of an individual to each existing cluster and the NIC. Section 4.A establishes several definitions to improve the computation of the MADs based on a Robust Distance criterion. Section 4.B presents the procedure to be followed to perform the merge process, while section 4.C shows the general procedure implemented to enhance the performance of the algorithm in unsupervised learning.

Definition 3 (Cauchy Marginal Adequacy Degree (CMAD)):
This parameter corresponds to the MAD computed using the fuzzy Cauchy function [44] (see Eq. 11), which corresponds to find a membership function µ c (x) that models the similarity of an individual to a cluster. dist(x, x 0 ) is the distance of the individual x to a prototype member x 0 . This function has been chosen because it allows computing the membership degree according to a normal distribution, with the addition of a penalty factor based on distances (Definition 4), to improve the calculation of the MADs.
The application of Eq. (11) in LAMDA, for x =x j and x 0 = ρ k,j (descriptor j of the centroid of the cluster k), now redefines the MAD as CMAD (see [21] for more details): To keep the MAD criterion of Definition 1, it is set as CMAD NIC,j x j | ρ k,j = 0.5.
Definition 4 (Robust Marginal Adequacy Degree (RMAD)): This parameter corresponds to the product of the CMAD and a penalty factor K k,X computed for each cluster k. To obtain K k,X , two parameters are required: the first one is the distance of the individualX to the center of each cluster k (d k,X ), which is calculated as [21]: And the second parameter is the threshold d nb ∈ [0, 1], called ''average distance between neighbors'', which must be set by the user (in section 5.C is described a method to calibrate this parameter).
Statement 4: The penalty factor K k,X is computed with Eq. (14). If the average distance d k,X is greater than d nb (d k,X > d nb ), then K k,X is computed as [21]: As is shown in Eq. (14), if dist d k,X , d nb increases, then K k,X decreases.
Statement 5: If the average distance d k,X is less than d nb , (d k,X ≤ d nb ), then K k,X is set to 1, because it is not required to penalize the CMAD of individuals that are within the threshold. Now, RMAD k,j is computed as [21]: As is shown in Eq. (15), RMAD k,j is equal to CMAD k,j if the condition of statement 5 is met, this is, the distance between the individualX and the cluster k is within the threshold d nb . According to statement 4, when the distance between the individualX and the cluster k is greater than the threshold, then CMAD is penalized; therefore, a decrease in the adequacy degree is established. The K k,X parameter reinforces the measure of the degree of similarity based on distances. The two established conditions of d k,X affect the computation of the RMAD. The following two properties P1 and P2 demonstrate it: The penalty factor for the NIC is set K NIC,X = 1, because it is not required to penalize the Non-Informative Class. As observed in Eqs. (16) and (17), the distance d k,X allows penalizing the dissimilarity between the samples and the clusters. This parameter is called Robust Distance, hence, this proposal takes the name of LAMDA-RD. Once calculated RMAD, the computation of the GAD is like the original LAMDA, using Definition 2 and Statement 3, but now with RMAD instead of MAD.

B. AUTOMATIC MERGE ALGORITHM
To describe the automatic merge algorithm for LAMDA, the following definitions are formalized: Definition 5 (A Cluster C k ): It is described by the tuple: where ρ k,j is the centroid of the descriptor j in the cluster k, which must be updated every time that a new individual is assigned to C k (see Eq. (4)),X k is the set of individuals in C k , index is the identifier of C k , and n k is the number of samples in the cluster. Definition 6 (The Neighbor Cluster C nb ): It is described as: where ρ nb,j is the centroid of the descriptor j in the cluster nb, X nb is the set of individuals in C nb , index is the identifier of C nb , and n nb is the number of individuals in the cluster. LAMDA is non-iterative, therefore, in the clustering process, one individual is analyzed at a time (a useful feature in streaming data). So, according to the LAMDA fundamentals, the GADs are the membership degrees that an individual has in each cluster [45]. The maximum GAD is where the individual is assigned, so, we can conclude that the second GAD of greater value is the nearest neighbor cluster.
The main problem to solve in this paper is the drawback of the original LAMDA: the excessive creation of clusters. So, it is essential to perform an automatic merge. Our proposal is characterized by similarity measures based on distances and densities. In the merge stage, we can have two cases: a) If the individual was assigned to the NIC, and therefore, a new cluster was created. It is the case: individualcluster (see Figure 1-a). b) If the individual was assigned to an existing cluster C k .
It is the case: cluster -cluster (see Figure 1-b). The procedure of the merge process is described in the pseudo-code presented below.
Definition 7 (Measure of the Compactness of the Neighbor Cluster t nb,j ): It is the mean value of all the distances (in each descriptor) among the individuals belonging to the neighbor cluster C nb , and it is computed as: ; ∀j = 1, . . . , n (20) VOLUME 8, 2020

Algorithm 2 Merging Process
Input: Clusters C k and C nb . Procedure: 1. Calculate the compactness of the neighboring cluster t nb,j using Definition 7. 2. Calculate the distance between the individuals of the cluster C k and C nb . Individuals whose distances are less than t nb,j in each descriptor j belong to the overlap zone (see Definition 8). 3. Determine the ratio of individuals in the overlap area with respect to the total of individuals between the two neighboring clusters (D k−nb ), as is shown in Definition 9. 4. Set a density threshold in the overlapping area D t , and verify if the condition D k−nb ≥ D t is met to proceed with the merge process (Statement 6). 5. End. Output: cluster updated or created by the algorithm wherex i nb,j is the descriptor j of the individual i in the cluster C nb .
Definition 8 (Number of Individuals in the Overlapping Area (N I )): This parameter is computed by counting the individuals in the overlapping area of the clusters C k and C nb , whose distance between its individuals is less than t nb,j . For this, we first identify the individuals of each cluster C k and C nb that meet that condition, and then, the cardinality of the resulting subsets is calculated as: where N k and N nb are the number of individuals in the overlapping area for the cluster C k and C nb , respectively. The total number of individuals in the overlapping area N I is:

Definition 9 (D k−nb ):
It is the density in the overlapping area between two clusters C k and C nb , and it is computed as: Statement 6: Two clusters C k and C nb are merged, if D k−nb ≥ D t · D t ∈ [0, 1] is a density threshold set by the user. A high D t value implies a greater density of individuals in the overlapping area. Figure 2 shows the cases in which the condition of statement 6 is not satisfied. It is observed that the new individual X increases the density of the overlapped area between the clusters C k and C nb . However, if D k−nb < D t , then the algorithm does not proceed to do the merge process, considering that there is not enough similarity between the two analyzed partitions.

FIGURE 2.
Graphical example to assign a new sample to a cluster, when statement 6 is not met, (a) the sample creates a new cluster, or (b) the sample is assigned to a pre-existing cluster. Figure 3 shows the cases in which the statement 6 is satisfied. It is observed that the new individualX increases the density of the overlapped area between the clusters C k and C nb . If D k−nb ≥ D t , then the algorithm proceeds to do the merge process, considering that there is enough similarity between the two analyzed groups.
Definition 10 (Resulting New Cluster (C new )): The resulting cluster after the merge process is given by the tuple: wherex t new,j is the descriptor j of the individual t in the clusters C k and C nb that form the new cluster C new .
As shown in Figure 3, each time that an individual is assigned to a cluster, it is evaluated if the density of the overlapping area has increased. The density is considered as a requirement to determine if the merge process should be executed according to the threshold D t .

C. GENERAL PROCEDURE OF LAMDA-RD WITH AUTOMATIC MERGE ALGORITHM
This procedure is repeated for each individual. The scheme of Figure 4 details the fuzzy clustering stage with the extension of an automatic merge stage based on distances and densities. The first step is the normalization of the descriptors of the individual to be assigned to a cluster. Next, the RMAD calculations are made for each descriptor in each cluster, using the Cauchy function, which considers the K k,X parameter that penalizes the dissimilarity between the individual and the clusters based on distances, as is shown in the Eqs. (14) and (15). With RMAD, the GAD in each cluster is computed, setting a high value for the exigency level (α = 1), with the aim to get a strict behavior (non-permissive algorithm).
The highest GAD defines the cluster in which the individual must be assigned (and its parameter ρ k,j is updated). However, if the maximum GAD corresponds to the NIC, then a new cluster is created, being this individual the first sample of the new group. In the merge stage is evaluated if this process is required between the cluster in which the individual was assigned and the neighboring cluster (defined by the second-largest GAD), this because the individual can be located in the overlapping zone between both clusters, fulfilling the merge requirement of statement 6.
In general, the algorithm starts with m = 0. When the first sample to be evaluated arrives, then the first cluster is created (m = 1). Next, when the second sample arrives, then it is evaluated, and if the conditions established by the algorithm are met, then this sample is assigned to cluster 1, otherwise, a new cluster is created (m = 2). This process is followed successively for all the samples, until evaluating the last sample N, assigning it to one of the current clusters or a new one. Thus, the algorithm does not require the definition of the number of clusters (m). The number of clusters depends on the data characteristics.

V. EXPERIMENTS AND RESULTS
In this section, the experimental tests in different clustering tasks are presented. The goal of the experiments is to validate the proposed method, analyzing the cluster quality and the  . This algorithm has been selected since it allows online clustering, a characteristic to be considered in order to make a fair comparison with LAMDA. In this test, the individuals are acquired from streaming data, since the algorithms are based on an online operation. iii. The applicability LAMDA-RD to a real case study (gas lift well production), to evaluate the behavior of our proposal in a real scenario. The tests described in i) and ii) are validated using datasets from [47]- [49]. We have selected the datasets due to different characteristics such as: number of individuals and features, level of intra-cluster overlap to observe how the allocation of individuals is made in those cases, balanced and unbalanced classes, and finally, the number of clusters (see detail in Table 1). Datasets with a large number of data are: Dim 1024, Unbalance and Postures (high-dimensional) and their analysis is required to observe the cluster quality and to measure the machine time to perform the partitions. 2-dimensional datasets are used for visualization purposes, to easily observe the behavior of the different algorithms. In all benchmarks, the original dimensionality has been maintained to make a fair comparison between the algorithms (tested under the same characteristics). A correlation analysis to reduce the dimensionality of the datasets (i.e. PCA) is not the objective of this paper during the validation of our proposal, but it is an analysis that should be considered in future works.
The following metrics have been chosen, in order to evaluate the intrinsic and extrinsic characteristics of the obtained model, and the analysis of the intra and extra-cluster qualities. A more detailed description of them can be found in [50].
Silhouette coefficient (SC) : it is a metric between [−1,1], −1 for incorrect clustering and 1 for highly dense clustering (dense and well separated), values around zero indicate overlapping clusters. This is composed of two values, a SC (x) is the mean distance between an individual and all other individuals in the same cluster, and b SC (x) is the mean distance between an individual and all other individuals in the nearest cluster. If the value is bigger, then the clustering is better. Considering N, the number of elements of the dataset, SC is computed as: Modification of the Silhouette coefficient (SILA): It improves the analysis of the Silhouette coefficient because SC can show an incorrect partitioning scheme when there are large differences in distances between groups. The SILA index contains an additional component to overcome this drawback. SILA measures the cluster compactness, which increases when a cluster size increases considerably, and it reduces the high values of the index caused by large differences between the groups [51]. This index varies in the same range as SC.
Sum-of-squares within clusters (SSW ): it is an internal measure used to evaluate the cohesion of the clusters that the algorithm has generated. The smaller the value is, the better the clustering. It is defined by Eq. (28).
in the cluster C k , ρ k is its centroid, and m is the number of clusters.
Sum-of-squares between clusters (SSB): it is a prototypebased separation measure used to evaluate the inter-cluster distance. If the value is bigger, then the clustering is better. It is defined by Eq. (29).
where n k is the number of elements in the cluster k, ρ g is the mean value of the whole data set (global center). WB-index (WB index ) [50]: it is based on SSW and SSB. It emphasizes the effect of SSW multiplying it by the generated number of clusters m. This metric is an alternative to methods based on knee point detection because most indices show monotonicity with an increasing number of clusters. Therefore, indices with a clear minimum or maximum value are preferred, being WB index one of them. Being a relationship between SSW and SSB, it can be noted that the lower its value, the better the quality of the formed clusters. In cases in which it is necessary to know the optimal number of groups, the WB-index are plotted for different number of partitions, and the model with the minimum value is chosen as the optimum. This index is defined in Eq. (30).
Performance Coefficient (P c ): it is a metric that we propose, which is a relationship between SC or SILA and WB-index, in order to establish which of the tested algorithms presents the best performance. The value of P C must be minimal and greater than zero, because WB must be small and SC or SILA must be positives and close to 1, to establish good formed clusters.
Rand Index (RI ): It is one of the most known indices for measuring the similarity between partitions, being necessary the construction of a consensus matrix [52]. Assume that there are two partitions P (r) , r = 1, 2, in a set of N individuals S = {s 1 , . . . , s N }. Partition P (1) has k 1 clusters, and partition P (2) has k 2 clusters. In order to compare the two partitions, the following terms are defined with two pairs of individuals coming from P (1) and P (2) : a: the number of the pairs of s i and s j belonging to the same cluster in P (1) and P (2) .
b: the number of the pairs of s i and s j belonging to the same cluster in P (1) and to different clusters in P (2) .
c: the number of the pairs of s i and s j belonging to different clusters in P (1) and to the same cluster in P (2) d: the number of the pairs of s i and s j belonging to different clusters in P (1) and P (2) .
The term a + d is the number of agreements between the partitions P (1) and P (2) . On the other hand, b+c is the number of disagreements between the partitions P (1) and P (2) . The Rand Index computes the number of agreements over the total pairs.

A. COMPARISON OF LAMDA-RD WITH OTHER CLUSTERING ALGORITHMS
In the following experiments, the parameters of the compared algorithms are tuned with the same care and separately for each dataset, to make a fair comparison. The calibration procedure of LAMDA-RD is presented in section 5.C, analyzing the sensitivity of the results.
This test is done to compare the quality of the formed clusters with respect to the results original LAMDA and other methods, which are generally iterative, do not work online, and require the number of clusters as input parameter.
These algorithms will serve to make a good comparison in terms of performance. Some conventional algorithms (K-means (KM), K-medoids (KMD), Fuzzy c-means (FCM), DBSCAN (DBS)), and some new algorithms, such as Agglomerative hierarchical tree (AHT) [4], Spectral clustering (SPC) [53], Hierarchical density-based clustering (HDBSCAN ''HDB'') [54] and Link-based cluster ensemble framework with consensus function (CON) [55], are tested for the comparison. Figure 5 shows the methodology used for this experiment. It should be noted that LAMDA works with streaming data to form clusters, while the other algorithms require the complete dataset for this purpose; however, this test allows evaluating the quality of the created clusters in tasks in which historical data are available. For Eqs. (5) and (6), as we mentioned in Section III, we select for all the experiments p = 1. In order to obtain more reliable results, the experiment is repeated 20 times (#reps=20), each time performance metrics are computed, and from the obtained results, the ''Average'' and standard deviation ''Std'' of the metrics are computed, for observing the repeatability and the confidence interval for the experiment in the creation of clusters Table 2 presents the SC for each clustering algorithm, where the best average (highest value) in each benchmark has been marked in bold text. The standard deviation shows the variability of the results in the different tests.
The best algorithms have SC values the closest to 1, identifying dense and well-separated clusters. In all benchmarks, LAMDA-RD is better than original LAMDA, in most cases significantly improving the quality of the created partitions, for instance, see the results in Segment, or in the cases of Unbalance, s1, s2, s3 and a1, where SC goes from negative values (bad clustering) to positive values, in some cases better than conventional algorithms (SC close to 1). We can also observe that LAMDA-RD obtains as good performance as the best clustering algorithms in benchmarks as Dim1024,  and Hepta. Also is the best algorithm for Segment, Unbalance and s1, which are datasets of balanced and unbalanced distribution, with a maximum intra-cluster overlap of 9%. In the benchmarks R15, Aggregation and s2, our approach presents results very close to the best value (KMD). In s3 and a1, the algorithm decreases its performance due to the dispersion of the individuals (the overlap increases). Nevertheless, based on SC, it is observed that LAMDA-RD, in s3 and a1 datasets, presents better results with respect to DBS.
The weakness of LAMDA-RD in datasets with high overlap occurs since the number of clusters to be built is unknown. It has the same problems as density techniques as DBS and HDB, which decrease their performance since they are not based on distance optimization criteria, like KM, FCM or AHT, which is an important criterion in this context.
The Std in all cases shows good repeatability in the experiments, which makes it possible to notice that similar results are obtained in each iteration. The worst case is given in R15, where Std (0.128) reaches 14% of the average value, giving an idea of a good behavior of the algorithm.
The SILA index results are shown in Table 3, where it can be seen that the index decreases minimally with respect to SC ( Table 2). The only benchmark where a considerable difference is seen is for Dim1024 (SC=1 and SILA =0.002). This index decreases due to the great separability that exists between the 16 clusters, however, it is shown that all the  algorithms were able to make a good clustering since each group is well defined in the dataset. The results obtained in each benchmark are consistent between SC and SILA, showing that the best algorithms are the same for these two metrics.
The results of WB index for each clustering algorithm are presented in Table 4, where the best average (lowest value) in each benchmark has been marked in bold text.
As in the previous metric, our algorithm is the best for the WB index in the datasets: dim 1024, Hepta and s1, where the individuals have a percentage of overlap under (9%), and in the case where the clusters are unbalanced (Unbalance).
In the datasets Segment, R15 and Aggregation, our method is very close to the best values, as explained before, in cases where there is no overlap between groups. For s2, s3 and a1, the performance of our method decreases, due to the presence of individuals in overlapping areas. The other methods can build better models because they know the number of clusters to build; this is evidenced by the results obtained with the methods KM, KMD, FCM, AHT, and SPC whose results are quite similar in the last three benchmarks. Small values of Std, again show good repeatability in the experiments performed at each iteration.
Finally, in the previous section, we propose one way to determine the best algorithms with only one metric, called P C and P C SILA. These values are shown in Table 5 and 6, respectively. The best (lowest value) has been marked in bold text for each benchmark. The results presented in Tables 5 and 6 show that LAMDA-RD is the best algorithm for the following datasets: Dim 1024, Hepta, Unbalance and s1, which implies a good quality of the clusters formed, based on P C and P C SILA. In Segment, and R15, our approach has values very close to the best algorithm (KMD). The performance for s2, s3 and a1 is reduced in LAMDA-RD, DBSCAN and HDB, which is reasonable because they are based on densities, in which, if there are scattered individuals, then the algorithms cannot make a good assignment in the clusters. Also, we can see that our proposal works quite well when the groups have not overlapping between them.
In the case of the benchmark s1 (9% of overlapping), it is the best algorithm, concluding that the performance of the algorithm is not affected by individuals slightly overlapped between clusters. Also, based on the metrics, a very good behavior of the algorithm can be observed in unbalanced datasets (unbalance). When the overlapping percentage increases, e.g. in s2 (20% overlap), our algorithm still makes a good clustering; however, in the case of a1 and s3 (22% and 40% of overlapping, respectively), based on our experiments, we can conclude that density-based methods have problems of assigning of individuals located in the overlap zone. Methods like KMD, FCM, AHT and SPC have the advantage of knowing the number of clusters a priori, which makes it easier to assign those samples to the nearest cluster, e.g. in s3 (P C ≈ 4.48), while LAMDA-RD decreases its performance (P C ≈ 12.3), a value that shows that when there is an overlap greater than 20% between clusters, our proposal builds clusters with poor quality, incorrectly assigning individuals to the most similar clusters. Particularly, our proposal is better than DBS and HDB, the methods with which a fairer comparison can be made without setting the desired number of partitions.
Based on the P C and P C SILA, the results are quite consistent with SC, SILA and WB index . Our proposal works quite well if the overlapping between clusters is less than 20%. If it increases, then the iterative methods are better, which is logical due to their individual assignment methodology that allows minimizing distance functions at the intra-cluster, and maximizing inter-cluster distances until reaching optimal values; however, these iterative methods increase their computation time depending on the dimensions of the data set to perform the optimization. P C SILA shows very similar values to the PC. Because the SILA values decrease with respect to SC, the P C SILA index increases minimally. The only benchmark where a considerable difference is seen is in Dim1024 (P C = 0.142 and P C SILA = 61.73). The index increases considerably since SILA is close to zero, and finally, confirms the information we had regarding the separability of that benchmark. Once again, the results of PC and P C SILA are consistent in all the evaluated benchmarks.
Finally, the quality of the clusters related to the real classes of each benchmark is computed with RI . The results are computed with the best partitions obtained with each algorithm. These values are shown in Table 7. The best (highest values) has been marked in bold text. The results show that clusters constructed by LAMDA-RD have a high value of coinciding with the real classes, taking into consideration that RI is an extrinsic clustering validation measure that compares the output of the clustering method and the real results (classes). LAMDA-RD is better than LAMDA in all benchmarks, and in some datasets like Dim 1024, Hepta, R15, Unbalance and s1, the results are as good as the best algorithms, and in some cases better than them (see R15, and s1). In the rest of datasets, our method presents a good behavior, except for Segment, in which a high number of descriptors is affecting the performance of the density-based methods (see the values of LAMDA-RD, DBS and HDB), so, an evaluation of the relevant descriptors should be made, discarding those that do not adequately characterize each group. Due to the distribution and different densities of the clusters of Unbalance (see the distribution of data in [47]), LAMDA-RD, DBS and HDB algorithms are the best since they can clearly distinguish each group due to the separation that exists among them, without the existence of overlap.

B. PERFORMANCE COMPARISON OF LAMDA-RD AND OTHER ONLINE CLUSTERING ALGORITHMS
To analyze and determine how our method improves the behavior of the original algorithm and other online clustering algorithms that work with data stream, the following tests are performed. At this point, we present the time consuming (computational cost) of each proposal in a streaming data scenario. In this context, a successful algorithm must consider the following restrictions [56]: -Individuals continually arrive; -There is no control in the order in which the individuals are generated; -The size of a stream is (potentially) unbounded; -Data objects are discarded after they have been processed.
All these restrictions are considered in the test, for LAMDA family (LAMDA-RD, LAMDA-TP, original LAMDA), and another online clustering method called ADDclustering, for live data stream [46]. A maximum exigency parameter is set (α = 1), because it is desired a strict behavior for the algorithms in the assignment process. The control parameters of LAMDA-RD (d nb and D t ) have been heuristically set to obtain a number of clusters closer to the real classes in each dataset. The methodology used for this experiment is presented in Figure 6.
The experiment is repeated 20 times (#reps = 20), each time performance metrics are computed. Finally, from the obtained results, are computed the ''Average'' and standard deviation ''Std'', in order to observe the repeatability in the creation of clusters of each online algorithm.
The results of these statistic metrics are shown in Table 8, and the algorithm with the best average metric is marked in bold text. According to P C and P C SILA (the two metrics are similar, since SC and SILA are quite similar), our proposal is the best in all the cases, even with high-dimensional datasets (see Postures), which shows us a good scalability of our method at the cost of increasing the computational time, which is common in data stream scenarios. It can be observed that this metric increases directly proportional when the percentage of overlap between clusters increases (see the results of s1, s2, s3 and a1), which is expected because the process of clustering is more complex since different individuals can belong to two or more neighboring clusters.
LAMDA is good in Aggregation, especially in the SSB and SSW ; however, it can be noticed that it creates a high number of clusters, whose quality is not good when analyzing the results of SC, WB index or P C , being LAMDA-RD the best.
Concerning the computational cost, it can be noticed that our approach has the highest value, this is due to the additional operations that are executed for the merging stage. This time depends especially on the number of individuals and the number of dimensions, e.g. Segment (65.23s, 2310 individuals and 19 features), and Postures (600.8s, 74975 individuals and 15 features).
Additionally, it is observed that LAMDA-RD, with respect to ADDClustering, presents better results in all the benchmarks, which makes it a good alternative in online    Table 6.
clustering because both algorithms work with the density criterion to form the clusters. Evaluating P C , our method is always the best, in Segment (LAMDA-RD: 65.23s and ADDClustering: 1.780s) ADDclustering is faster, which shows that this algorithm works better with data of several descriptors, decreasing its performance when the number of individuals increases, e.g. Postures (LAMDA-RD: 600.8s and ADDClustering: 766.8s).
Comparing PC and P C SILA, it can be seen that the latter presents a higher value because SILA penalizes the large differences of distances between clusters in a dataset, which is clearly observed in the Dim1024 benchmark. Especially, in this benchmark can be observed how the SILA index, which uses a measure of cluster compactness, has a less value with respect to SC, and so, it makes the value of P C SILA bigger with respect to PC due to the large differences between clusters An illustration of the obtained clusters with the different algorithms in s1, is presented in Figure 7. The parameters of our approach (d nb and D t ) have been calibrated to obtain the desired number of clusters. On the other hand, LAMDA-TP creates 15 clusters of poor quality because it incorrectly  assigns individuals in different clusters (bad quality clusters); whereas, in the case of LAMDA, it creates a very large amount of clusters, in which a manual merge process should be applied. Finally, ADDClustering builds 4 clusters, and according to the results of Table 8 (for s1), it can be noted that the quality of the clusters is not as good as that obtained by LAMDA-RD, where all quality metrics are the best, e.g. P C = 2.044 and P C SILA = 2.153. The method that follows is ADDClustering, with P C = 4.532 (almost double), this is, the groups formed have better inter-cluster (the individuals in the same group are very similar to each other) and intra-cluster characteristics (dissimilar individuals are in different groups).

C. LAMDA-RD PARAMETER CALIBRATION
LAMDA-RD has two calibration parameters that affect the quality and number of the formed clusters: d nb (Definition 4) and D t (Statement 6).
A guideline for the calibration is presented below, which shows how the variation of the parameters d nb and D t (necessary to be set by the user) affects the quality of the clusters for the case of R15. Figures 8-a shows the variations of P C , depending on the parameters d nb and D t , and Figure 8-b shows its top view, in which the different areas are represented in colors. The yellow zone, e.g. (D t = 0.1, d nb = 0.27, P C = 15) presents high P C values, which as detailed in the experimental tests, this implies poor quality in the created clusters. Based on this, it is necessary to look for the zone with the minimum P C , in this case, the dark blue zones (which shows the next values: D t = 0.3, d nb = 0.03, P C = 1.577), that is, good quality clusters (the lowest P C ). However, the number of created clusters m must also be considered, which is represented in Figure 9-a as a function of the parameters d nb and D t , and Figure 9-b shows its top view. In this case, the yellow areas represent a high number of created clusters (D t = 0.9, d nb = 0.015, m = 56 clusters), while dark blue areas (D t = 0.1, d nb = 0.27, m = 1 cluster) are not useful because all data has been grouped into a single cluster. Finally, we observe the green zone (D t = 0.3, d nb = 0.03, m = 15 clusters), which coincides with the values of d nb and D t with the minimum P C (see Fig. 8). So, the general idea of the method is to find a balance between P C and m.
Based on the results of Figure 8 and 9, the following criteria can be established: -Low values of D t make the merging process between neighboring clusters with low or no density in the overlapped area (see Figure 10-a), which is not adequate since they VOLUME 8, 2020 produce a non-demanding or low exigency algorithm (as is shown in Figure 9 for dark blue zones), performing the merge process with separate or dissimilar neighboring clusters, which leads to poor quality clusters, as is shown by high P C in Figure 8 for the equivalent zone (yellow area). -High values of D t produce a more demanding algorithm (as is shown clearly in Figure 9 for yellow zones) since it requires a higher percentage of individuals in the overlapping area (see in Figure 10-b), performing the merge process only when the neighboring clusters are very close, which improves the quality of them, as is shown by low P C in Figure 8 for the equivalent zone (dark blue area). -Low values of distance between neighbors d nb allow obtaining a more demanding algorithm (see Figures 8 and 9, the best P C and a non-excessive number of clusters m is presented with a low d nb ), since the calculation of K k,j is stricter (strongly penalizing the dissimilarity between samples). High values of d nb produce a non-demanding or low exigency algorithm, by weakening the penalization for the dissimilarity between samples. Figure 11 shows the recommended zone for the initial parameter calibration, looking for a balance zone in Figure 8 and 9 to obtain a good P C , without creating an excessive number of clusters m. Based on this, it is possible to verify the quadrants of maximum and minimum exigency, and the balanced zone, which can be taken as a starting point to perform the search of the most appropriate D t and d nb .
The critical cases occur in the yellow zones of Figure 9, low D t and high d nb , which generate the minimum number of clusters. In the case of R15, the best results are obtained by calibrating d nb to a small value, as is shown in Figure 9, where we have a great variation of D t . Now, contrasting the results with Figure 8, the smaller P C must be located on the graph.
From the experimentation, a generic behavior could be observed, concluding that the parameter calibration can start with a value of D t ≈ 0.5, and d nb ≈ 0.1 × D t , e.g. for R15 the best values are: D t = 0.3, d nb = 0.03 (shaded area of Figure 11).
For example, for s1 dataset the formed clusters with different parameter values are shown below, in which the parameter D t is initially set D t = 0.54, and d nb is changed until finding the minimum P C (Figure 12 shows all the values of this metric when the parameters are changed), looking for an adequate number and quality of clusters (in this case, 15 clusters). By increasing the value of d nb , the algorithm creates fewer clusters, which are better constituted by covering the more dispersed individuals. On the other hand, by setting the value of d nb , fixed, and changing the values of D t , we observe that the algorithm is less strict when it is small, which implies a decrease in the number of clusters (less strict), as is shown in Figure 13.
The behavior of P C for the case when D t is fixed and d nb is changed, is shown in Figure 14-a, while the behavior of P C for the case when d nb is fixed and D t is changed, is shown in Figure 14-b; in both cases, the minimum P C is a guide to calibrate these parameters.

D. LAMDA-RD AND OUTLIERS
As detailed in Section III, LAMDA-RD initially has a descriptor normalization stage, which is essential to know how the algorithm works with the samples, allowing us  to analyze the behavior, performance and sensitivity of LAMDA-RD in the presence of outliers. Normalization is essential in LAMDA-RD since the algorithm does not eliminate outliers, which is not recommended because valuable information can be obtained from this data, especially in the context of a stream mining scenario [46]. For the normalization stage, the maximum and minimum values of each descriptor/feature are required, as shown in Eq. (2). The normalization process is detailed in the pseudo-algorithm presented with the label ''Algorithm 4''.
If there are descriptors with outliers, then the normalization process limits them to values of 0 or 1, in order to calculate the marginal and global adequacy. To observe the behavior of LAMDA-RD with outliers, we use for simplicity the R15 benchmark (600 samples and 2 descriptors). The Maximum and Minimum Limits (MML) of the original dataset are X min = [3.402; 3.178] and X max = [17.124; 17.012]. Figure 15 shows the different partitions obtained by the algorithm when the MML of the dataset are modified, so that data outside of these limits are considered outliers.    Table 9 presents the evaluation metrics of the formed clusters, in order to analyze the performance and sensitivity. Table 9 shows the performance of LAMDA-RD when the MML of the dataset are modified, in order to artificially create outliers with its own data. When we consider the original MML of the dataset (0 outliers), it is observed that the algorithm correctly assigns the samples in each group (see Figure. 15-a). In case 2, two MML have been reduced, observing that the algorithm groups the outliers on the edge of the normalization values, merging the clusters 7 and 13 of

Algorithm 4 Normalization Stage of LAMDA-RD
Input: Sample X = x 1 ; x 2 ; . . . ; x j ; . . . ; x n X min = x 1min ; x 2min ; . . . ; x jmin ; . . . ; x nmin X max = x 1max ; x 2max ; . . . ; x jmax ; . . . ; x nmax Procedure: 1. Apply (2) in each descriptor x j to obtain the normalized descriptorx j . 2. Ifx j > 1, then End Output: Normalized sampleX Figure 15-a, resulting in the cluster 13 of Figure 15-b. In this case, the metrics change minimally with respect to case 1, since in this experiment there are 14 well-separated clusters. In case 3, the four MML have been reduced, observing that the algorithm is able to group the outliers on the edge of the normalization values without affecting the creation of the internal clusters. Finally, in case 4, the MML of descriptor 1 have been reduced considerably, obtaining 343 outliers, which affects the distribution of the samples. The algorithm groups the outliers on the edge of the normalization values, and merges the groups (7,8,10,11,13,14,15) and (4,6) of Figure 15-a, forming the clusters 1 and 4 of Figure 15-d, respectively, in which 8 partitions are obtained. The performance metrics are quite good and vary from the original partition (case 1) because the samples have a new distribution due to the high percentage of outliers, which are greater than 50% of the entire dataset. Based on the results obtained, we can determine that LAMDA-RD presents a good response working with outliers, and it is not sensitive to them due to the initial normalization. Finally, in real applications, the MML values could be statistically determined by calculating the interquartile range.

E. REAL CASE STUDY: GAS-LIFT METHOD
Gas-lift is a technology to extract oil from wells that have low reservoir pressure, by reducing the hydrostatic pressure in the tubing. Gas is injected into the tubing, as deep as possible, and mixed with the fluid from the reservoir (see Figure 16). The gas reduces the density of the fluid in the tubing, which reduces the bottom pressure (PWF), and thereby, increases the production from the reservoir. The dynamics of highly  oscillatory flow in a gas lifted well can be described as follows: i) Gas from the casing starts to flow into the tubing. As gas inputs in the tubing, then the pressure in the tubing falls. This accelerates the inflow of gas; ii) The gas pushes the major part of the liquid out of the tubing; iii) Liquid in the tubing generates a blocking constraint in the injection orifice. Hence, the tubing gets filled with liquid and the annulus with gas; iv) When the pressure on the injection orifice overcomes the pressure on the tubing side, then a new cycle starts.
The Artificial Gas-Lift (AGL) well behavior's model (see Figure 17) shows that when the gas injection rate increases, then the production also increases until reaching its maximum value; but additional increases in the gas injection will cause a production diminution [57], [58].  For the implantation in the field of the AGL method, it is needed an instrumentation and control arrangement [57], [59]. For such task, we need the measurement and control of the following variables (see Figure 18): Pressure of the Injected Gas (GLP), Differential Pressure of the Injected Gas (GLDP), Pressure of the Casing (CHP), Pressure of the Tubing of Production (THP).
The measurement of the injected flow is carried out using the GLP and GLDP variables. The measurement of the pressure casing (CHP) allows knowing the pressure that the gas exercises in the casing (THP), the pressure exercised by the fluids in the pipeline, and the pressure of the line of production (PLP). Other important variables are the Gas flow of lift (FGL. expressed as ''mpcgd''-thousands of gas cubic feet per day), and the Rate of Production (Qprod: expressed as ''BNPD''-barrels net of production per day).
In [60], this case study is used to evaluate the performance of LAMDA in the context of a classification problem (supervised learning), to explain the behavior of these oil wells. In this paper, the goal is to identify the most consistent clusters according to the Rate of Production (Qprod: ''in a gas-lift well), using unsupervised learning, based on the following set of descriptors: Descriptor 1: Casing Pressure Descriptor 2: Production Tubing Pressure.
Descriptor 3: Gas Lift Flow. Descriptor 4: Bottom Pressure. These variables have been suggested by the experts as the most appropriate for the identification. The historical data has 1187 individuals, corresponding to 4 classes: very low production (VLP), low production (LP), normal production (NP), and high production (HP). Table 7 reports the results of the metrics after evaluating the performance of each algorithm, in which it can be noted that LAMDA-RD again presents the best results. Based on the proper calibration of its parameters, it is possible to identify the 4 clusters, corresponding to the number of classes of Qprod. On the other hand, ADDClustering presents good results, however, the standard deviation of the number of classes found in each iteration is high, which shows that for this case study, the change in the order of the data that arrives to the algorithm considerably affects the performance of this algorithm.
The results show that LAMDA-RD is better in all the metrics, especially the P C = 1.692, and in the external metrics RI = 0.919, Accuracy = 0.947 and Fmeasure = 0.908. P C metric defines the quality of the clusters, and the external validation indices show the level of correspondence between the created clusters and the real classes of the dataset. Based on this, in LAMDA-RD the metrics exceed 90%, which indicates an excellent performance of the proposed algorithm, if we compare it with the other methods where the external validation indices are close to 70% in the best case for LAMDA-TP, or with values less than 40%, as in LAMDA and ADDClustering. Another detail to be observed is the machine time of LAMDA-RD, which is around twice that the best (LAMDA-TP), with a difference of 2.468s.
The real distribution of the data (after ordering them in each class) is shown in Figure 19-a, the best clusters created by LAMDA-RD (Figure 19-b) are very similar to the real ones, with a small amount of misassigned individuals grouped in the clusters. In other words, the algorithm is able to identify the correct number of clusters, performing a good assignment of the individuals. The best results of LAMDA-TP (see Figure 19-c) show that clusters 1 and 2 are correctly constructed in relation to the real labels; however,  in cluster 3 there are partition errors and incorrect assignment of individuals, with individuals assigned to cluster 4, such that is partitioned into two groups incorrectly in relation to the labeled data.
LAMDA for class 1 (see Figure 19-d) generates 1 cluster, in class 2 it generates 2 clusters, in class 3 the algorithm creates 5 clusters, and 6 clusters have been generated in class 4, which is inadequate and impractical, evidencing the need of a merging algorithm. The results detailed above are summarized in Table 11. Figure 20 shows the ROC (Receiver Operating Characteristic) curve for LAMDA-RD (because only our approach builds the same number of clusters as the real classes), to analyze their sensitivity and specificity in the diagnostic processes. Diagnostic methods with high specificity are required because we are interested in seeing the negative results identified by the algorithms correctly, and also, high sensitivity is required since each state of the system must give a positive result during the diagnostic test, according to the class that is represented by each functional state. In ROC curve, the ideal value is close to the point (0,1), which represents a very good diagnostic method. ROC curves have been drawn for each class in the case study. Table 12 reports average metrics of sensitivity, specificity and Area Under the Curve (AUC), where it can be observed quantitatively that LAMDA-RD performs a very good clustering. Note that curves for the classes 1 and 4 are overlapped. In general, although LAMDA-TP allows a good clustering results, the algorithm still makes mistakes; however, it considerably reduces the number of created clusters, but individuals not adequately characterized by the descriptors are misassigned. LAMDA creates an excessive number of clusters because, as seen in Figure 19, it is evident that the real number of clusters is much lower than the one identified by the algorithm. Also, the quality of the clusters is not good, which is supported by the metrics of Table 10. It is clear that LAMDA-RD corrects the problems of the other two algorithms, forming good quality clusters if an appropriate parameter calibration is performed.

F. COMPUTATIONAL COMPLEXITY
We proceed to analyze the computational complexity of LAMDA-RD in terms of memory usage, computation time and number of operations on the clustering tasks. Our program is implemented in Matlab R2020a, and it is run on an Intel (R) Core (TM) i7-8750H @ 2.2GHz microprocessor. The analysis is based on the spatial (memory usage and arithmetic complexity) and temporal complexity of the proposed algorithm.

1) MEMORY USAGE
In this subsection, the permanent usage of memory is counted. The number of parameters required to perform the clustering tasks is based on the number of descriptors and formed clusters, n and m, respectively. According to Eqs. (2) to (15), the number of parameters (#parameters) to be stored in memory is: In addition, if there are N samples, each with n descriptors, the total number of stored values is: It is assumed that each value is stored in 2 bytes of memory [62]. It can be concluded that its complexity linearly increases.

2) NUMBER OF OPERATIONS
In this subsection is evaluated the number of arithmetic operations (arithmetic complexity) used to solve a problem.
Addition, subtraction, multiplication, division, power and root are considered as basic operations. Following the procedure of pseudo-algorithm of LAMDA-RD, the number of operations in each step to assign one sample to a cluster is detailed in Table 13. The arithmetic complexity of LAMDA (C L ), compared with LAMDA-RD (C RD ), is computed in the case of Table 13: C RD = 25nm − 18m + n n 2 nb − n nb + 22 + 2 (n nb + n k ) − 7 (37) If LAMDA-RD clustering process is considered without the merge stage, then it presents similar arithmetic complexity with respect to the original LAMDA. Both algorithms have linear characteristics in the terms n and m, especially depending on the number of descriptors, and increasing as new clusters are created. On the other hand, when we consider the addition of the merge algorithm that is the case of LAMDA-RD, it is observed a quadratic exponent in the term of the number of elements n nb in the neighbor cluster C nb , which is multiplied by the number of descriptors n. It can be concluded that its complexity increases quadratically as more samples are added to the clusters.

3) TEMPORAL COMPLEXITY
The temporal complexity is used to verify the increase of the operations performed in the evaluation of each benchmark. Based on the number of samples, we can compute the average time required to evaluate each sample, as is shown in Table 14.
The computation time of LAMDA-RD increases in the most of cases respect to LAMDA due to the merge algorithm. Generally, the computational time increases as the number of descriptors increases, i.e., in Segment with 19 descriptors and Postures with 15 descriptors, the time required by LAMDA-RD is 4 times greater than the required by LAMDA. In general, for LAMDA-RD, it is between two and four times larger, but in some cases, it is the same (Dim 1024, R15), or even smaller (Hepta). However, as has been shown in the experimental results, the benefits of our proposal are observed by analyzing the quality of the formed clusters. Thus, the temporal complexity increases, but at the same time, a considerable improvement in the results of LAMDA-RD is evident.

VI. GENERAL ANALYSIS OF THE RESULTS
With the different tests carried out, we summarize the following results: -The main objective of this paper is to improve LAMDA in clustering tasks, for which we have proposed LAMDA-RD, an algorithm that in all cases considerably improves the performance of the original algorithm (see the results from Table 2 to Table 10). -Based on the results of P C , metric that considers SC and WB-index, we can notice that in 4 of the 10 datasets tested (Dim1024, Hepta, Unbalance and s1), LAMDA-RD obtains the best results, in terms of performance, while in Segment (high dimensionality ), R15 and Aggregation, is close to the best algorithms. In Postures (benchmark with a large number of samples and high dimensionality) is the best algorithm when compared to other clustering methods focused on data streams, achieving the objective of this paper of obtaining a competitive algorithm that in all cases improves the performance of LAMDA. -The tests have been performed on balanced and unbalanced datasets with different overlapping. In cases where there is no overlap, the algorithm works as well as KM, KMD and AHT. With an overlap of 9%, as in the case of s1, LAMDA-RD presents the best results, while with an overlap of less than 20% (s2 and a1), the algorithm has an intermediate performance. Also, it is noted that the performance decays in s3, which has a 40% overlap (strong non-Gaussian distribution of feature values), where it is complicated to make an online assignment of elements working in a streaming data scenario, based on distances and densities.
-In the context of the data stream scenario, LAMDA-RD, based on the performance metrics of Table 6, is widely superior to LAMDA, LAMDA-TP and ADDClustering, since our proposal has a merging process that allows avoiding the creation of an excessive number of clusters. Particularly, the expert must calibrate the parameters to get an adequate model. -Our proposal is able to work on clustering problems in an appropriate manner. Our method reaches the best results comparing with other well-known clustering algorithms in benchmarks with overlap <20% and unbalanced datasets, such as: Dim1024, Segment, Unbalance and s1. While increasing the individuals in the overlap area for instance s2, s3, and a1, the algorithm decreases its performance since it is based on density measurements. The advantage of our algorithm is that it can work in online mode with streaming data. However, it is not adequate when the dataset has a high number of individuals, because the algorithm's execution time will considerably increase. -The advantage of LAMDA-RD is that it can discover new groups with a low computational cost. The addition of a merging algorithm avoids the creation of an excessive number of poor quality clusters, which is demonstrated by the performance metrics. -LAMDA has the characteristic of making intrinsically a split process, generating new classes when it does not identify the similarity between the individual and the clusters.
However, the quality of the clusters generated is not good as we show in the experiments, especially in cases of high overlapping, which LAMDA-RD corrects presenting better results in all the benchmarks. -The parameters d nb and D t regulates the requirements in the clusters to be merged. Specifically, when setting D t at a low value, then a merging process is made between nearby, but dissimilar clusters, whereas when setting it to a high value, then the merge is made between nearby and similar clusters. -In gas lift wells, LAMDA creates an excessive number of clusters of bad quality (see Table 11), obtaining a model that does not characterize the system based on real classes. LAMDA-TP considerably reduces the number of clusters created; however, it has the issue with the low production cluster, since it divides it into two groups, which is not adequate since its characteristics are similar in all the individuals of that state. LAMDA-RD identifies the majority of individuals in their respective clusters compared to real classes, and perfectly improves the assignment made by LAMDA, performing the merging process in order to reduce the number of classes to the correct value. -The robust distance (RD) related to d nb allows improving the quality of the resulting clusters, since this term penalizes the dissimilarity between the individuals and the clusters. Figures 8 and 9 show how this term affects the quality of clusters related to the minimum P C and the final number of clusters created m. These figures have shown that proper calibration of the d nb parameter (related to RD) plays an important role in the final result.

A. ADVANTAGES AND DRAWBACKS OF THE PROPOSED EXTENSION
The main advantages of the algorithm are: • Competitive results in the context of a stream mining scenario.
• A Cluster quality improvement over the original LAMDA algorithm, results supported by several performance metrics.
• It does not require knowing a priori the number of clusters in the clustering process.
• It works correctly with individuals of 20 descriptors, as is seen in the results, i.e. in Segment benchmark.
• It is a non-iterative method to obtain the model, reducing the number of operations compared to other clustering methods.
• It is a white box with simple operations that are easily modified to obtain better results. The main drawbacks of the algorithm are: • Two parameters for calibration that must be fine-tuned for best results.
• Increased computational complexity compared to the original LAMDA algorithm.
• The temporal complexity increases as the number of descriptors and the number of elements in the clusters to be merged increases

VII. CONCLUSION
In this paper has been proposed extensions for the LAMDA algorithm in the clustering context. These extensions are based on two strategies, the first one by calculating the MAD with the Cauchy function, adding a factor to penalize the individual-cluster dissimilarity, to make a better assignment to a cluster. Additionally, an automatic algorithm has been added to LAMDA to perform the merge process, which analyses the similarity between neighboring clusters to decide if this process is carried out or not. The merging algorithm has an additional execution time, because it evaluates the overlap based on distance and density measures of similarity of the clusters to perform the merge; however, the computational cost is compensated with the algorithm's ability to avoid creating an excessive number of clusters. In addition, parameters can be calibrated to increase or decrease the number of clusters, a feature that is not possible in LAMDA-TP and in LAMDA. In general, in the comparative study with LAMDA and LAMDA-TP, it was possible to demonstrate that LAMDA-RD significantly improves the performance and the clusters formed, especially with the metrics SC, WB index and RI .
LAMDA-RD has been tested in several benchmarks with different overlapping percentage, and its results have been compared with other clustering algorithms. In these comparisons, it was determined that in cases when the overlapping is 0-20%, our method presents results as good as iterative methods (KM, KMD and FCM), and as the overlapping increases its performance decreases because it is more complex to make an assignment when the elements have characteristics of several clusters at the same time, which are difficult to differentiate, cases where iterative methods are the best. Specifically, our proposal is the best in the cases: Dim1024, Segment, Unbalance and s1, that is, when the amount of elements in overlapping areas is not excessive.
The results of the gas lift well are satisfactory since the algorithm has been able to identify the expected production states, partitioning the individuals (wells) into good quality clusters with the greatest similarity. The density threshold D t , and the distance between neighbors d nb allow to the expert to calibrate the number of the desired clusters.
As future work, we propose to improve the performance of the algorithm when it is tested in strong non-Gaussian distribution of feature values, and address in detail the curse of dimensionality in datasets with a very high number of features. Also, we want to combine the clustering algorithm with supervised learning features to implement a hybrid algorithm based on LAMDA, which can be applied in systems with labeled and unlabeled data. Additionally, we want to improve the algorithm performance computing the optimum threshold for each cluster for the merge process, and formalize in a more precise way the parameter calibration of the algorithm.