Adaptive Resonance Theory-based Topological Clustering with a Divisive Hierarchical Structure Capable of Continual Learning

Adaptive Resonance Theory (ART) is considered as an effective approach for realizing continual learning thanks to its ability to handle the plasticity-stability dilemma. In general, however, the clustering performance of ART-based algorithms strongly depends on the specification of a similarity threshold, i.e., a vigilance parameter, which is data-dependent and specified by hand. This paper proposes an ART-based topological clustering algorithm with a mechanism that automatically estimates a similarity threshold from the distribution of data points. In addition, for improving information extraction performance, a divisive hierarchical clustering algorithm capable of continual learning is proposed by introducing a hierarchical structure to the proposed algorithm. Experimental results demonstrate that the proposed algorithm has high clustering performance comparable with recently-proposed state-of-the-art hierarchical clustering algorithms.


I. INTRODUCTION
C LUSTERING is an essential technique to extract information from data. With the recent development of IoT technology, the importance of clustering increases as the availability of big data increases. Generally speaking, big data includes a wide range of useful information, e.g., hidden characteristics of data and explicit/implicit relationships among data points and/or attributes. To utilize useful information in big data with a pluralistic and appropriate information granularity, information extraction approaches such as hierarchical clustering are widely studied. In particular, since a divisive hierarchical clustering algorithm can adaptively generate partitions in response to the changes in the data distribution, the algorithm has the potential to realize continual learning, which is an essential ability to efficiently utilize big data.
Typical divisive hierarchical clustering algorithms with an adaptive cluster structure are Growing Hierarchical Self-Organizing Map (GHSOM) [1] and Growing Hierarchical Neural Gas (GHNG) [2]. GHSOM uses Self-Organizing Map (SOM) [3] as a base clustering approach while GHNG uses Growing Neural Gas (GNG) [4]. Both GHSOM and GHNG can grow in the vertical and horizontal directions by constructing a tree-like structure based on the distribu- tion of data points. However, in general, SOM-based and GNG-based clustering algorithms cannot avoid the plasticitystability dilemma [5], i.e., the trade-off between catastrophic forgetting and continual learning of new information.
In the research field of clustering, Adaptive Resonance Theory (ART) [6] is a successful approach for handling the plasticity-stability dilemma. Among existing ART-based clustering algorithms [7]- [12], Fast Topological Correntropy-Induced Metric-based ART (FTCA) [12] has superior clustering performance and functionality than the others. FTCA utilizes the Correntropy-Induced Metric (CIM) [13] as a similarity measure for realizing a fast and stable self-organizing ability while maintaining an appropriate number of nodes. During a learning process, FTCA adaptively generates topological structures by using nodes and edges to efficiently extract information from a dataset.
In our previous study [14] inspired by GHNG [2], we have introduced Hierarchical FTCA (HFTCA) to improve the clustering performance of FTCA [12]. Although HFTCA shows superior clustering performance than GHSOM and GHNG, HFTCA has a parameter (i.e., a similarity threshold) that significantly affects its clustering performance. In general, the optimal value of the similarity threshold depends on the dataset, which is a common difficulty in the use of ART-based clustering algorithms. Moreover, if an ARTbased clustering algorithm has a hierarchical structure, it is necessary to specify a similarity threshold at each layer in advance. In this paper, to solve the above difficulty, we propose CIM-based ART with Edge and Age (CAEA), which has a mechanism to automatically estimate the similarity threshold from the distribution of data points. In addition, we also propose Hierarchical CAEA (HCAEA) for improving the clustering performance of CAEA. Thanks to the continual learning ability of CAEA, HCAEA is also capable of continual learning.
The contributions of this paper are summarized as follows: (i) A new ART-based clustering algorithm, called CAEA, is proposed by introducing a mechanism to automatically estimate a similarity threshold from the distribution of data points. (ii) A new divisive hierarchical clustering algorithm, called HCAEA, is proposed by integrating CAEA into a hierarchical structure inspired by HFTCA. (iii) Empirical studies show that HCAEA has comparable clustering performance than recently-proposed state-ofthe-art hierarchical algorithms: GHNG, HFTCA, and GH-EXIN.
The paper is organized as follows. Section II presents the literature review for clustering algorithms and divisive hierarchical clustering algorithms. Section IV describes details of the proposed algorithm and its hierarchical approach, namely CAEA and HCAEA. Section V presents simulation experiments to evaluate the information extraction performance of the proposed algorithms and the above-mentioned state-ofthe-art algorithms by using real-world datasets. Section VI summarizes characteristics of each algorithm based on the experimental results. Section VII concludes this paper.

II. LITERATURE REVIEW A. CLUSTERING ALGORITHMS CAPABLE OF CONTINUAL LEARNING
Clustering is one of the most widely used approaches for extracting information from data. Gaussian Mixture Model (GMM) [15] and k-means [16] are typical examples of clustering algorithms. Although GMM and k-means are quite simple and highly applicable, they have a common limitation that the number of classes needs to be specified in advance.
One effective approach for solving the drawback of GMM and k-means is growing self-organizing clustering such as GNG [4] and Self-Organizing Incremental Neural Network (SOINN) [17]. GNG and SOINN can adaptively generate topological networks based on a distribution of data points. However, since these algorithms permanently insert new nodes into their networks for memorizing new information, they have the potential to forget learned information (i.e., catastrophic forgetting). This trade-off is called the plasticitystability dilemma [5].
Several clustering algorithms have been proposed to cope with the plasticity-stability dilemma. As a GNG-based algorithm, Grow When Required (GWR) [18] and gamma-GWR [19] are successful algorithms which appropriately calculate a similarity threshold to prevent an excessive node creation. As a SOINN-based algorithm, SOINN+ [20] and SOINN+ with ghost nodes (GSOINN+) [21] can detect clusters of arbitrary shapes in noisy data streams while avoiding catastrophic forgetting. Another successful approach is ART-based clustering algorithms such as Fuzzy ART [7], Bayes ART [8], and their variants [9], [10]. In general, ART-based clustering algorithms show superior clustering performance than GNGbased and SOINN-based algorithms [11], [12]. Moreover, because ART-based clustering algorithms can theoretically realize sequential and class-incremental learning without catastrophic forgetting, a number of ART-based clustering algorithms and their improvements have been proposed in both supervised learning [22]- [24] and unsupervised learning [7], [8], [25], [26]. One common drawback of ART-based clustering algorithms that they need to specify a similarity threshold (i.e., a vigilance parameter). In most cases, the similarity threshold has a significant impact on their clustering performance while its optimal value depends on the dataset.

B. DIVISIVE HIERARCHICAL CLUSTERING ALGORITHMS
There are two types of hierarchical approaches, i.e., an agglomerative approach and a divisive approach [27]. An agglomerative hierarchical clustering algorithm takes a bottomup approach where data points are first considered as separate clusters and then merged into larger clusters based on the similarity between clusters. A divisive hierarchical clustering algorithm takes a top-down approach where all data points are considered as a cluster and then divided into smaller clusters based on the dissimilarity between clusters. In general, agglomerative hierarchical clustering algorithms require the entire dataset because of its learning procedure. Some wellknown divisive hierarchical clustering algorithms such as the bisecting k-means algorithm [28] do not have this ability. That is, they need all data points in advance. Thus, they are not suitable for continual learning whereas they show high clustering performance. In contrast, divisive hierarchical clustering algorithms can be used for continual learning by adapting a cluster structure along with the change in the distribution of data points. Thanks to the above characteristics, divisive hierarchical clustering algorithms have the potential to realize continual learning.
In order to perform continual learning in a divisive hierarchical clustering algorithm, it is necessary to have the ability to adaptively extract information and generate a hierarchical structure in response to sequentially given data points. GHSOM [1] utilizes a growing SOM which can grows in the vertical and horizontal directions based on the distribution of data points. One drawback of GHSOM is raised by a SOM architecture, that is, it is difficult for GHSOM to represent multiple data distributions in a single SOM network. GHNG [2] has successfully resolved the drawback of GHSOM by using GNG instead of SOM. One difficulty of GHNG is a large number of parameters. Growing Hierarchical EXcitatory+INhibitory (GH-EXIN) [29] requires only a few parameters while maintaining comparable performance to GHNG. In general, however, GNG-based clustering algorithms suffer from excessive node generation and high sensitivity to the presentation order of input data points. Moreover, the plasticity-stability dilemma [5], i.e., the tradeoff between catastrophic forgetting and continual learning of new information, is another unavoidable problem. Applying an ART-based clustering algorithm as a basis of divisive hierarchical clustering is a promising approach because it can theoretically avoid the plasticity-stability dilemma. HFTCA [14] showed the superior clustering performance compared to GHSOM, GHNG, and GH-EXIN while maintaining a high information compression ratio. Although HFTCA is the state-of-the-art of ART-based divisive hierarchical clustering algorithm, HFTCA requires careful parameterization at each hierarchy in order to maintain good clustering performance.

III. PRELIMINARY KNOWLEDGE
This section presents preliminary knowledge for a similarity measurement and a kernel density estimator that are used in the proposed algorithm.

A. CORRENTROPY AND CORRENTROPY-INDUCED METRIC
Correntropy [13] provides a generalized similarity measure between two arbitrary data points x = (x 1 , x 2 , . . . , x d ) and y = (y 1 , y 2 , . . . , y d ) as follows: where E [·] is the expectation operation, and κ σ (·) denotes a positive definite kernel with a bandwidth σ. The correntropy can be estimated as follows: In this paper, we use the following Gaussian kernel in the correntropy: A nonlinear metric called CIM is derived from the correntropy [13]. The CIM quantifies the similarity between two data points x and y as follows: Here, thanks to the Gaussian kernel without a coefficient 1 √ 2πσ as defined in (3), a range of the CIM is limited to [0, 1]. In general, the Euclidean distance suffers from the curse of dimensionality. However, the CIM reduces this drawback thanks to the correntropy which calculates the similarity between two arbitrary data points by using a kernel function. Moreover, it has also been shown that the CIM with the Gaussian kernel has a high outlier rejection capability [13].

B. KERNEL DENSITY ESTIMATOR
In general, a bandwidth of a kernel function σ can be estimated based on a kernel density estimator [30], which is defined as follows: where Γ denotes a rescale operator (d-dimensional vector) which is defined by a standard deviation of attribute values of each attribute among N training data points. ν is the order of the kernel. R(F ) is a roughness function. κ ν (F ) is the moment of the kernel. The details of the derivation of (5) can be found in [30].

IV. PROPOSED ALGORITHM
In this section, first CAEA is explained in detail. Next, a hierarchical approach of CAEA, namely HCAEA, is introduced. Table 1 summarizes the main notations used in this paper.  A set of prototype nodes y k = (y k1 , y k2 , . . . , y kd ) d-dimensional prototype node (the kth node) S = {σ1, σ2, . . . , σ K } A set of bandwidths for a kernel function κσ Kernel function with a bandwidth σ CIM Correntropy-Induced Metric k1, k2 Indexes of the 1st and 2nd winner nodes y k 1 , y k 2 The 1st and 2nd winner nodes Similarities between a data point xn and winner nodes (y k 1 and y k 2 ) V threshold Similarity threshold (a vigilance parameter) N k 1 A set of neighbor nodes of node y k 1 M k 1 The number of data points that have accumulated by the node y k 1 λ Predefined interval for computing σ and deleting an isolated node a (k 1 ,k 2 ) Age of edge between nodes y k 1 and y k 2 amax Predefined threshold of an age of edge C = (c1, c2, . . . , c K ) Models of CAEA for constructing a hierarchical structure.

A. CIM-BASED ART WITH EDGE AND AGE
We use the following notations: A set of training data points is X = {x 1 , x 2 , . . . , x n , . . .} where x n = (x n1 , x n2 , . . . , x nd ) is a d-dimensional feature vector. A set of prototype nodes in CAEA at the time of the presentation of a data point x n is Y = {y 1 , y 2 , . . . , y K } (K ∈ Z + ) where a node y k = (y k1 , y k2 , . . . , y kd ) has the same dimension as x n . Furthermore, each node y k has an individual bandwidth σ for the CIM, i.e., S = {σ 1 , σ 2 , . . . , σ K }.
The learning procedure of CAEA is divided into four parts: 1) initialization process for nodes and a bandwidth of a kernel function in the CIM, 2) winner node selection, 3) vigilance test, and 4) node learning and edge construction. Each of them is explained in the following subsections.

1) Initialization Process for Nodes and a Bandwidth of a Kernel Function in the CIM
In the case that CAEA does not have any nodes, i.e., a set of prototype node Y = ∅, the 1st to λ/2th training data points X init = {x 1 , x 2 , . . . x λ/2 } directly become prototype nodes, i.e., Y init = {y 1 , y 2 , . . . , y λ/2 }, where λ ∈ Z + is a predefined parameter of CAEA. Note that λ must be greater than or equal to 2, and if λ/2 is not an integer, it is rounded to the nearest integer. This parameter is also used for a node deletion process.
In an ART-based clustering algorithm, a vigilance parameter (i.e., a similarity threshold) plays an important role in a self-organizing process. Typically, the similarity threshold is data-dependent and specified by hand. On the other hand, CAEA uses the minimum pairwise CIM value between each of nodes in Y init = {y 1 , y 2 , . . . , y λ/2 }, and the average of pairwise CIM values is used as the similarity threshold V threshold , i.e., where σ is a bandwidth for a kernel function in the CIM.
In CAEA, the initial prototype nodes Y init = {y 1 , y 2 , . . . , y λ/2 } have a common bandwidth of the Gaussian kernel in the CIM, i.e., S init = {σ 1 , σ 2 , . . . , σ λ/2 }. When a new node y K+1 is generated from x n , a bandwidth σ K+1 is estimated from the past λ/2 data points, i.e., {x n−λ/2 , . . . , x n−2 , x n−1 } by using (5) and (6). As a result, each new node has a different bandwidth σ depending on the distribution of training data points. Although the similarity threshold V threshold depends on the distribution of the initial λ/2 training data points, we regard that an adaptive V threshold estimation is realized by assigning a different bandwidth σ, which affects the CIM value, for each node in response to the changes in the data distribution.

2) Winner Node Selection
Once a data point x n is presented to CAEA with the prototype node set Y = {y 1 , y 2 , . . . , y K }, two nodes which have a similar state to the data point x n are selected, namely, winner nodes y k1 and y k2 . The winner nodes are determined based on the state of the CIM as follows: 4 VOLUME 4, 2016 where k 1 and k 2 denote the indexes of the 1st and 2nd winner nodes, i.e., y k1 and y k2 , respectively. S is a bandwidth of the Gaussian kernel in the CIM for each node.

3) Vigilance Test
Similarities between the data point x n and the 1st and 2nd winner nodes are defined as follows: The vigilance test classifies the relationship between a data point and a node into three cases by using a predefined similarity threshold V threshold , i.e., The similarity between the data point x n and the 1st winner node y k1 is larger (i.e., less similar) than V threshold , namely: • Case II The similarity between the data point x n and the 1st winner node y k1 is smaller (i.e., more similar) than V threshold , and the similarity between the data point x n and the 2nd winner node y k2 is larger (i.e., less similar) than V threshold , namely: • Case III The similarities between the data point x n and the 1st and 2nd winner nodes are both smaller (i.e., more similar) than V threshold , namely:

4) Node Learning and Edge Construction
Depending on the result of the vigilance test, a different operation is performed. If x n is classified as Case I by the vigilance test (i.e., (14) is satisfied), a new node y K+1 = x n is added to the prototype node set Y = {y 1 , y 2 , . . . , y K }. A bandwidth σ K+1 for node y K+1 is calculated by (9). In addition, the number of data points that have been accumulated by the node y K+1 is initialized as M K+1 = 1.
If x n is classified as Case II by the vigilance test (i.e., (15) is satisfied), first, the age of each edge connected to the first winner node y k1 is updated as follows: where N k1 is a set of nodes that are connected to y k1 by an edge. After updating the age of each of those edges, an edge whose age is greater than a predefined threshold a max is deleted. Then, y k1 is updated as follows: When updating the node, the difference between x n and y n is divided by M k1 . Thus, the larger M k1 becomes, the smaller the node position changes. This is based on the idea that the information around a node, where data points are frequently given, is important and should be held by the node.
In Case II, a counter M for the number of data points that have been accumulated by y k1 is also updated as follows: If x n is classified as Case III by the vigilance test (i.e., (16) is satisfied), the same operations as Case II (i.e., (17), (18), and (19)) are performed. In addition, the neighbor nodes of y k1 are updated as follows: Equation (20) has the same concept as (18), but it should be less affected by the data point than y k1 because it is the neighbor node of y k1 . Thus, the value is multiplied by 1/10.
In Case III, moreover, if there is no edge between y k1 and y k2 , a new edge is generated and its age is initialized as follows: In the case that there is an edge between nodes y k1 and y k2 , its age is also reset by (21).
Apart from the above operations in Cases I-III, as a noise reduction purpose, the nodes without edges are deleted every time λ training data points are given.
The learning procedure of CAEA is summarized in Algorithm 1. Note that the prediction procedure of CAEA is that an unknown data point is assigned to the class of its nearest neighbor node.

B. HIERARCHICAL APPROACH FOR CAEA
The procedure for creating the hierarchical structure of HCAEA is as follows. First, CAEA is trained with a set of training data points X = {x 1 , x 2 , . . . , x n } to generate a topological network (nodes and edges) in the first layer. Supposing that Y = {y 1 , y 2 , . . . , y K } is generated. Here, we preserve training data points that affect each node y k during training process. As a result, we can define a new set of K training data point sets In the second layer, CAEA is independently trained by using each subset X k (k = 1, 2, . . . , K) of training data points. This mechanism is used in a hierarchical manner for the training of CAEA. That is, CAEA in the (h+1)th layer is independently trained by using subsets of training data points, each of which is defined by the corresponding node in the hth layer. The above procedure is repeated until an additional layer is no longer created by CAEA. That is, when the number of nodes (i.e., K) becomes two in the hth layer, it is too small to represent the distribution of data points. As a result, the training for creating the (h+1)th layer is not performed. Create the new node as yK+1 = x l . 5 Calculate the bandwidth for a kernel function σK+1 by (8) and (9).
Calculate the vigilance parameter Vthreshold by (7). Search the indexes of winner nodes k1 and k2 by (10) and (11), respectively. 10 Update the edge age a (k 1 ,j) by (17). 11 if a (k 1 ,j) > amax then 12 Delete the edge. 13 if V k 1 > Vthreshold then 14 Create the new node as yK+1 = x l . 15 Calculate the bandwidth for a kernel function σ k+1 by (8) and (9). 16 else 17 Update the state of y k 1 by (18). 18 if V k 2 ≤ Vthreshold then 19 Update the state of neighbor nodes yj by (20). 20 Create a new edge e (k 1 ,k 2 ) between y k 1 and y k 2 .

21
if the number of data point inputs l is multiple of a topology adjustment cycle λ then 22 forall k ∈ 1, 2, . . . , K do 23 if y k does not have any edge then 24 Remove y k from Y. 25 return the CAEA model.
The learning procedure of HCAEA is summarized in Algorithm 2. Note that the prediction procedure of HCAEA is similar to CAEA. One difference from CAEA is that an unknown data point is assigned to the class of its nearest neighbor node which does not have a child node in the generated tree-like structure.
Algorithm 2: Learning Algorithm of HCAEA Input: a set of training data points: the interval for computing σ and deleting an isolated node: λ, and the threshold of an age of edge: amax. Output: the HCAEA model. model contents a set of training data points for the next layer: Extract a set of data points X k that have accumulated by the node y k as the training data points for the next layer. 6 c k = LearningHCAEA(X k , λ, amax). 7 Update the HCAEA model. return. 10 return the HCAEA model.

V. SIMULATION EXPERIMENTS
This section presents quantitative comparisons focusing on the information extraction performance of CAEA, HCAEA, GHNG [2], GH-EXIN [29], and HFTCA [14] based on the classification performance. In general, the evaluation of the clustering performance is subjective if a dataset does not have label information (i.e., each pattern in the dataset has no class label). In this paper, therefore, we use datasets with label information and perform classification tasks by using a clustering result as a classifier. This approach allows us to indirectly evaluate the clustering performance, that is, the performance of approximating the data distribution. The source code of GHNG 1 , GH-EXIN 2 , and HFTCA 3 are provided by the authors of the related papers.

A. DATASETS
We utilize five synthetic datasets and nine real-world datasets selected from the commonly used clustering benchmarks [31] and public repositories [32], [33]. Table 2 summarizes statistics of the datasets.

B. PARAMETER SPECIFICATIONS
CAEA, HCAEA, GHNG [2], GH-EXIN [29], and HFTCA [14] have their own parameters, which influence their clustering performance. This section presents parameter specifications of each algorithm in detail. Table 3 summarizes parameter settings of each algorithm. In each algorithm, one parameter which has a large impact on the clustering performance is specified by grid search while the other parameters are specified as follows: In GHNG, the parameter τ is unique and rest of them are the same as GNG. Therefore, we use the commonly used GNG parameter settings in GHNG. In GH-EXIN, some parameters are the same as GNG while min card and H perc are unique ones. min card and H perc are specified in [29]. In HFTCA, a topology construction cycle λ is the same as in [14].
During grid search in our experiments, the training data points in each dataset are presented to each algorithm only once without pre-processing. In addition, each training data point is input under a stationary environment, i.e., the training data points are randomly selected from the entire data. For each parameter specification, we repeat the evaluation 20 times (i.e., 2×10-fold cross validation) with a different random seed for obtaining consistent averaging results. Table 4 summarizes parameters which are specified by grid search. Using the parameter specifications in Tables 3 and 4, each algorithm shows the highest Normalized Mutual Information (NMI) [34] score for each dataset under a stationary environment. Note that a symbol "-" in HFTCA means that the layer is not generated.

1) Conditions
In order to demonstrate the information extraction performance of CAEA and HCAEA, we conduct classification tasks not only in a stationary environment but also in a nonstationary environment. In the stationary environment, the training data points are randomly selected from the entire dataset. In the non-stationary environment, the training data points are randomly selected from a specific class in the dataset, and the class is shifted sequentially.
Similar to grid search, we repeat the evaluation 20 times with no pre-processed training data points and a different random seed. The same parameter specifications are used in the stationary and non-stationary environments for each algorithm as explained in Tables 3 and 4. The classification performance is evaluated by Accuracy, NMI [34], Adjusted Rand Index (ARI) [35], and macro-F 1 .
As a statistical analysis, the Friedman test and Nemenyi post-hoc analysis [36] are utilized. The Friedman test is used to test the null hypothesis that all algorithms perform equally. If the null hypothesis is rejected, the Nemenyi post-hoc analysis is then conducted. The Nemenyi post-hoc analysis is used for all pairwise comparisons based on the ranks of results over all the evaluation metrics for all datasets. Here, the null hypothesis is rejected at the significance level of 0.05 both in the Friedman test and the Nemenyi post-hoc analysis. All computations are carried out on Matlab 2020a with 2.2GHz Xeon Gold 6238R processor and 768GB RAM. Table 5 shows the results of the classification performance in the stationary environment. The best value in each metric is indicated by bold. The standard deviation is indicated in parentheses. N/A indicates that an algorithm could not build a predictive model. As an overall trend, GH-EXIN and HCAEA show better performance than CAEA, GHNG and HFTCA. Here, the null hypothesis is rejected on the Friedman test over all the evaluation metrics and datasets. Thus, we apply the Nemenyi post-hoc analysis. Fig. 1 shows a critical difference diagram based on the classification performance including all the evaluation metrics and datasets. Better performance is shown by lower average ranks, i.e., on the right side of a critical distance diagram. In theory, different methods within a critical distance (i.e., a red line) do not have a statistically significance difference [36]. In Fig. 1, GH-EXIN shows the best performance but there is no statistically significant difference from HCAEA. Comparing HCAEA and CAEA, although there is no statistically significant difference, HCAEA shows a lower rank. This suggests that a divisive hierarchical structure of HCAEA has a positive impact on the classification/clustering performance. Table 6 shows the number of generated nodes by each algorithm in the stationary environment. As a general tendency, GH-EXIN generates a large number of nodes and GHNG generates a small number of nodes. The number of nodes in HCAEA is larger than that of CAEA because of its hierarchical structure. Note that the number of layers of VOLUME 4, 2016   N/A indicates that an algorithm could not build a predictive model. A symbol "-" in HFTCA means that a layer is not generated.

2) Stationary Environment
HCAEA is automatically specified by the algorithm. In this experiment, HCAEA generates a single layer for COIL20, Isolet, Sonar, TOX171, and Wine datasets. These datasets have a small number of training data points compared to the number of attributes. From the above observations, it can be concluded that HCAEA can adaptively and automatically specify the sufficient number of nodes and layers for extracting information. Table 7 summarizes the results of training time in the stationary environment. In general, GHNG is faster than all the other algorithms because the number of generated nodes is smaller than the others. Regarding GH-EXIN, the computation speed depends on the number of attributes in the dataset. Namely, GH-EXIN is fast when the number of attributes is small while it is slow when the number of attributes is large. The computational efficiency of HCAEA and CAEA depends on the number of generated nodes. In short, the computation speed is slow when the number of generated nodes is large, and vice versa.    The standard deviation is indicated in parentheses. N/A indicates that an algorithm could not build a predictive model.

3) Non-stationary Environment
In the case of the non-stationary environment, a continual learning ability is required to maintain the classification/clustering performance because the training data points are randomly selected from a specific class of the dataset, and the class is shifted sequentially. In general, whether an environment is stationary or nonstationary is unknown and dynamic. In other words, an algorithm should have characteristics that do not deteriorate its classification/clustering performance in any environment. For this reason, the parameter specifications in the nonstationary environment are the same as in the stationary environment, i.e., Tables 3 and 4. Table 8 shows the results of the classification performance in the non-stationary environment. The best value in each metric is indicated by bold for each dataset. The standard deviation is indicated in parentheses. N/A indicates that an algorithm could not build a predictive model. Similar to the results in the stationary environment, the classification performance of GH-EXIN and HCAEA is generally better than CAEA, GHNG and HFTCA. Here, the null hypothesis is rejected on the Friedman test over all the evaluation metrics and datasets. Thus, we apply the Nemenyi post-hoc analysis. Fig. 2 shows a critical difference diagram based on the classification performance including all the evaluation metrics and datasets. Similar to the results in the stationary environment, GH-EXIN shows the best performance but there is no statistically significant difference from HCAEA. Regarding HCAEA and CAEA, there is a statistically significant difference between these two algorithms. This result indicates that HCAEA has the stable and superior information extraction performance than CAEA in the non-stationary environments. Table 9 shows the number of generated nodes by each algorithm in the non-stationary environment. The general trend is the same as in the stationary environment, where GH-EXIN has a large number of generated nodes and GHNG has a small number of them. Regarding HCAEA, this algorithm generates a single layer for COIL20, Isolet, Sonar, TOX171, and Wine datasets, which are the same as in the stationary environment. Table 10 summarizes the results of training time in the nonstationary environment. The general trend of the computation speed is also the same as in the stationary environment. Namely, GHNG is faster, the computation speed of GH-   EXIN depends on the number of attributes, and the computation speed of HCAEA and CAEA depends on the number of generated nodes. Fig. 3 shows box plots with all the results of NMI obtained by grid search in Section V-B for each dataset in the stationary environment. Depending on the values of the parameters in the grid range, each algorithm shows high/low NMI. By observing the variations of NMI, the sensitivity of an algorithm to each parameter specification can be evaluated. Although the median of NMI is different, as a general tendency, the interquartile ranges are similar for each algorithm on each dataset except for small datasets (e.g., Jain and Wine) and high-dimensional datasets (e.g., COIL20 and Isolet). In regard to the parameters specified by grid search (see Table  4), those values for GHNG, GH-EXIN, and HFTCA are selected from the entire grid range. In contrast, the parameters of CAEA and HCAEA are mostly selected from a certain range, i.e., {24, 26, 28, 30}, although the candidate of grid values are {10, 12, . . . , 30}. This observation suggests that CAEA and HCAEA can apply the same parameter setting to a wide variety of datasets.

E. COMPUTATIONAL COMPLEXITY
This section presents the computational complexity of CAEA, HCAEA, and comparison algorithms. Specifically, CAEA and HCAEA are analyzed in detail.
Here, assuming n is the number of data points, λ is an interval for adapting σ, d is a dimension of a data point and a node, and K is the number of generated nodes. In CAEA, the computational complexity of each process is as follows: Computing a bandwidth of a kernel function in the CIM is O( n λ d) (line 5 in Alg. 1). Computing the CIM is O(nd λ 2 ) (line 7 in Alg. 1), and for sorting the result of the CIM is O( λ 2 log λ 2 ) (line 7 in Alg. 1). Similarly, for computing the CIM in line 9 is O(ndK) and for sorting the result of the CIM in line 9 is O(K log K). In general, λ n and K n. Thus, the total computational complexity of CAEA is O(nd(λ + K)).
In HCAEA, assuming that the number of data points in the m l th partition of the lth layer of HCAEA is n lm l , then the computational complexity is O( n lm l λ d + n lm l d λ 2 + λ 2 log λ 2 + n lm l dK lm l + K lm l log K lm l ). Here, K lm l is the number of generated nodes in the m l th partition of the lth layer. Thus, the total computational complexity from the 2nd to the Lth layer is O( L l=2 m l i=1 ( n li λ d + n li d λ 2 + λ 2 log λ 2 + n li dK li + K li log K li )). In general, λ n and K n. As a result, the total computational complexity of HCAEA is O(nd(λ + K) + L l=2 m l i=1 n li d(λ + m l )). A self-organizing process of GHNG uses GNG but its hierarchical approach is similar to HCAEA. The computational complexity of GNG is O(nK). Thus, the total computational complexity from the 2nd to the Lth layer is Regarding GH-EXIN, its computational complexity is analyzed as O(bJn log b n) [29] where b is the average branching factor, J is the average number of epochs, and n is the number of data points. Note that further information can be found in [29].
A learning algorithm and a hierarchical approach of HFTCA is the same with HCAEA except for a calculation of the vigilance parameter. The computational complexity of the calculation of the vigilance parameter can be ignored here. Thus, the computational complexity of HFTCA is ). The computational complexity for each algorithm is summarized in Table 11.

VI. DISCUSSION
This section summarizes the characteristics of each algorithm based on the results in Sections V-C2 and V-C3 for emphasizing properties of HCAEA and CAEA. In regard to GHNG, the computation speed is faster than all the other algorithms. However, the information extraction performance is inferior to the others. One possible reason for its poor performance is that a network generated by GNG does not adequately approximate the distribution of the data points in each dataset. This drawback can be resolved by increasing the number of training epochs, but this is a major disadvantage in terms of the concept of continual learning.
GH-EXIN maintains the superior information extraction performance both in the stationary and non-stationary environments. However, GH-EXIN tends to generate a large number of nodes, and the computation speed of GH-EXIN depends on the number of attributes in the dataset. Although VOLUME 4, 2016 GH-EXIN shows superior information extraction performance, the above characteristics become disadvantages when the algorithm is applied to a real-world dataset with a large number of attributes. The common disadvantage of the GNGbased algorithms (i.e., GHNG and GH-EXIN), is that there are a lot of parameters to be specified in advance (see Table  3). Moreover, the specified parameter values are significantly for different datasets (see Table 4).
Regarding HFTCA, its information extraction performance is similar to GHNG, and inferior to GH-EXIN, CAEA, and HCAEA. In addition, the vigilance parameter V , which is a predefined parameter, has to be specified to each layer in advance. If the value of the vigilance parameter V is not properly defined, the algorithm cannot generate a sufficient number of nodes to approximate the distribution of data points. Specifically, as shown in Section IV-C, since TOX171 is a high-dimensional data with a small number of data points, HFTCA could not build the predictive model because generated nodes are deleted as isolated nodes before the network (i.e., edge connections) is fully constructed. This problem may be avoided by giving training data points multiple times until the network is sufficiently constructed.
CAEA maintains better information extraction performance than GHNG and HFTCA both in the stationary and non-stationary environments. Moreover, CAEA and HCAEA can apply the same parameter setting to a wide variety of datasets. One disadvantage of HCAEA and CAEA is that these algorithms tends to have poor classification performance over comparison algorithms on datasets with a large number of classes (i.e., COIL20, Isolet, and OptDigits). This is because that CAEA (and also HCAEA) estimates the similarity threshold V and a bandwidth σ for a kernel function from a very small number of training data points compared to whole data points, the similarity threshold V and the bandwidth σ are not suitable for the distribution of data points if the number of classes is large. This problem could be solved by adapting a large value of λ. As shown in Figs. 1 and 2, a unique drawback of CAEA is that the information extraction performance is significantly different between the stationary and non-stationary environments. By introducing a hierarchical structure to CAEA, HCAEA successfully solves the disadvantage of CAEA. In general, HCAEA shows better information extraction performance than CAEA and comparable performance to GH-EXIN without the specification of a large number of parameters. One problem of HCAEA is that the algorithm tends to generate excessive nodes in the case of Iris and Pathbased datasets. However, this problem can be solved by implementing a mechanism to delete unnecessary nodes in HCAEA.

VII. CONCLUSION
This paper proposed an ART-based topological clustering algorithm, called CAEA. CAEA automatically estimates a similarity threshold, i.e., a vigilance parameter, from the distributions of data points. In addition, a divisive hierarchical clustering algorithm capable of continual learning, called HCAEA, was also proposed by applying a hierarchial structure to CAEA. The experimental results showed that the hierarchical structure of HCAEA improves the information extraction performance while solving the disadvantage of CAEA, i.e., the information extraction performance is significantly different between the stationary and non-stationary environments. Moreover, it has been shown that HCAEA has various advantages (i.e. a small number of parameters, an automated mechanism for hierarchical structure design) and comparable clustering performance to recently-proposed state-of-the-art hierarchical clustering algorithms.
In this paper, we focused only on the avoidance of catastrophic forgetting in order to achieve stable continual learning. However, dealing with concept drift, i.e., the change of the concepts in the learned information, is also important for maintaining a continual learning ability [37]. Therefore, as a future research topic, we will consider dealing with concept drift in HCAEA in order to extend the functionality of the algorithm.