Class-Incremental Learning With Deep Generative Feature Replay for DNA Methylation-Based Cancer Classification

Developing lifelong learning algorithms are mandatory for computational systems biology. Recently, many studies have shown how to extract biologically relevant information from high-dimensional data to understand the complexity of cancer by taking the benefit of deep learning (DL). Unfortunately, new cancer growing up into the hundred types that make systems difficult to classify them efficiently. In contrast, the current state-of-the-art continual learning (CL) methods are not designed for the dynamic characteristics of high-dimensional data. And data security and privacy are some of the main issues in the biomedical field. This article addresses three practical challenges for class-incremental learning (Class-IL) such as data privacy, high-dimensionality, and incremental learning problems. To solve this, we propose a novel continual learning approach, called Deep Generative Feature Replay (DGFR), for cancer classification tasks. DGFR consists of an incremental feature selection (IFS) and a scholar network (SN). IFS is used for selecting the most significant CpG sites from high-dimensional data. We investigate different dimensions to find an optimal number of selected CpG sites. SN employs a deep generative model for generating pseudo data without accessing past samples and a neural network classifier for predicting cancer types. We use a variational autoencoder (VAE), which has been successfully applied to this research field in previous works. All networks are sequentially trained on multiple tasks in the Class-IL setting. We evaluated the proposed method on the publicly available DNA methylation data. The experimental results show that the proposed DGFR achieves a significantly superior quality of cancer classification tasks with various state-of-the-art methods in terms of accuracy.

pair of probes to measure the intensities of methylated and unmethylated alleles at the interrogated CpG sites. Then the methylation level is estimated by measuring the intensities of this pair of probes, called beta-value, ranging from 0 to 1 [9]. The analysis of the DNA methylation level is a key ingredient in the development of cancer prognosis and personalized treatment approaches [10]- [12]. Therefore, the development of highly accurate statistical and computational techniques is required for further DNA methylation-based human cancer analysis.
In the domain of artificial intelligence (AI), the research field of machine learning (ML) has increasingly gained attention in various research fields including bioinformatics and computational biology [13]- [15]. More specifically, a critical research field within ML methods is DL that develops a biologically-inspired programming paradigm for manifold applications, such as in computer vision, natural language processing, audio recognition, speech recognition, social network filtering, bioinformatics, medical image analysis, material inspection, etc., [16]- [18]. DL technology can deliver findings in medicine comparable in some cases superior to human experts in the field of medical diagnosis of cancer and other diseases [19]. Most DL approaches for DNA methylation data focused on extracting biologically meaningful lower-dimensional, and estimating methylation status (imputation). As well as performing embeddings of CpG methylation states and classification and regression tasks [20]- [23]. Recent advances in DL, particularly unsupervised approaches, have shown promise for extracting biological knowledge through their application to genetic and epigenetic data [24]. An important advancement to DNA methylation-based DL analysis was the application of VAE [25]. It is a generative method that samples from the learned distribution of the methylation profiles to generate new data in a way that represents the original data without losing accuracy and complexity. By using these pre-trained generative models, researchers attempt to develop similar frameworks for feature extraction. That can be applied to downstream prediction tasks and identify biologically meaningful relationships revealed by VAE latent representation [26]- [31]. Although applications of DL networks to DNA methylation data have become ubiquitous, there still are challenging issues and a lack of practical methods.
Globally, there are more than 100 types of cancer, each has several subtypes [32], and an estimated 15 percent of all human cancers worldwide may be attributed to viruses [33]. New types of diseases have been increasing rapidly and their behaviors are unstable over time. For example, newly identified COVID-19 represents a significant portion of the global disease burden in 2020 [34]. Computational systems biology of cancer in the real world are exposed to continuous streams of patient information and thus are required to learn and remember multiple cancer and diseases from dynamic data distributions [35]. Therefore, developing CL, also referred to as lifelong learning [36]- [41] is highly needed in computational systems. Especially Class-IL [42]- [44], which consists of learning sets of classes incrementally, techniques for cancer classification tasks are discussed in this article.
Another practical issue in the medical system is data security and privacy [45]. CL remains a long-term challenge for DL models since the continuous accumulation of incrementally available information from non-stationary data distributions generally leads to catastrophic forgetting [46], i.e., training a new model with new information without losing previously learned knowledge. Catastrophic forgetting can be a critical issue for organizations that have to delete historic data for privacy reasons. For example, healthcare facilities might not be able to retain patient data permanently. Typical DL models require a large amount of data to learn parameters, which is a computationally expensive process, and it is needed to re-train a model repeatedly when new data comes. To train DL models efficiently without accessing past data, researchers attempted to use generative models such as VAEs or generative adversarial networks (GAN) [47] trained on past data [48]- [50]. Particularly, the deep generative replay [48] methods significantly improved the CL research in the past years by simply replaying all previous data using pre-trained generative models.
With the accumulation of high-dimensional low sample size data (HDLSS) in computational systems of real-world bioinformatics fields, Class-IL on these data is a critically important task. Traditionally, the dimension reduction or feature selection (FS) techniques are conducted as preprocessing before the classification tasks [51], [52]. After FS, some of the lower-dimensional features selected in the earlier tasks are not highly significant in the next tasks. Because some pairs of genes are common or specific for some cancer types [53], [54]. For example, assume that the highest significant 1,000 features are selected from breast and lung cancer. When other cancers come, whether some of the selected features are considered as significant or not. Most of the previous studies [48]- [50] handle a fixed number of features, e.g. image data, those are equally considered for all classes during Class-IL. Due to these unstable characteristics of the cancer data, IFS techniques are required to be developed. To our best knowledge, there are no studies conducted on incremental feature selection for high-dimensional data with deep generative models for Class-IL tasks.
To tackle the aforementioned issues, in this article, we propose a novel Class-IL method based on generative models, called DGFR, to incrementally select the most relevant features from high-dimensional DNA methylation data and then classify human cancer types when a new cancer type comes. The primary contributions of this article are highlighted as follows: • We propose a novel DGFR method for high-dimensional DNA methylation cancer classification tasks in a Class-IL manner. DGFR is incrementally trained without accessing past data when new cancer types come.
• We also propose a novel IFS technique with deep generative replay. IFS is theoretically simple and memory efficient. It only stores mean and standard deviation (SD) VOLUME 8, 2020 values for all features in memory and rank all features based on its variability.
• We introduce a soft replay, which is the updated version of the replay. IFS continuously updates the previously generated replays based on the newly selected features when new cancer types come. For past data, the duplicated features are kept, and not duplicated features are generated again from a normal distribution.
• Comparison of state-of-the-art continual learning methods on publicly available DNA methylation cancer datasets for class-incremental cancer classification tasks. Comprehensive experiments have demonstrated the superior quality of the proposed DGFR method.
We explore the effect of the number of samples trained in different ways such as randomly, ascending, and descending orders. In real-world cases, that is important to consider the number of samples for HDLSS data. Experiments on the cancer datasets have fully demonstrated the effectiveness of the proposed DGFR method as it has significantly outperformed the baselines. The remainder of this article is organized as follows. We first review the related works in Section II. In Section III, we formally describe the notations and explanations of the proposed DGFR method. We then describe the experimental settings in Section IV and show the experimental results, including discussions and analysis in Section V. Finally, we draw conclusions and future works in Section VI.

II. RELATED WORKS
In this section, we briefly summarize the recent research studies sequentially on FS from DNA methylation data, DNA methylation-based cancer classification, and continual lifelong learning.

A. FEATURE SELECTION
Discovering a lower number of CpG sites from highdimensional DNA methylation data relevant to specific cancer disease could derive in more effective treatments. Selecting only a small number of CpG sites from a large number of sites strongly correlates with targeted cancer [55]. More studies suggested that only a small number of CpG sites can be sufficient markers for specific cancer [56], where the CpG sites' biological relationship concerning the target cancer can be easily identified. Generally, FS techniques could be very useful for HDLSS data problems [57] and the right FS strategy is crucially important for the classification performance [58]. There are many FS techniques; they can be divided into three categories such as filter, wrapper, and embedded; are different in the way each technique copes with a higher dimension to form a subset of features. Most of the DNA methylation-based cancer studies used variance-based filtering FS techniques to select the most variable CpG sites across several samples before performing VAE and classification algorithms [26]- [31]. The advantages of filter techniques are simple and fewer computations compared to the other two categories. The highly variable CpG sites are assumed to be biologically more meaningful than the lower variable sites. Filter techniques are performed in the selection model as a pre-processing step and can be followed by one or more classification algorithms.

B. CANCER CLASSIFICATION
Recently intensive studies of DNA methylation-based cancer analysis have been well conducted on effective training strategies for deep architectures, which are all based on an unsupervised pre-training followed by supervised finetuning. There is a lack of ground-truth labels in the bioinformatics domain. Therefore unsupervised DL approaches such as GAN and VAE harness the modeling power of DL without the need for accurate labels. Tybalt [26] was developed to extract biologically relevant information from cancer gene expression data with VAEs. The learned features were generally non-redundant and can reveal biologically meaningful relationships among subgroups of samples. Similar to this, an unsupervised DL framework with VAEs, applied to the DNA methylation data from three breast cancer datasets [27] and two lung cancer datasets [28], [29]. Those DNA methylation-based DL approaches have not been designed as user-friendly for execution, training, model interpretation. MethylNet [30] was developed to pre-train data, generate new data, make predictions, and discover unknown heterogeneity with minimal user supervision. However, public cancer data is rapidly increasing, there is also a lack of samples for specific cancer types in research. To alleviate this issue, methCancer-gen [31] was presented to generate a userspecified cancer type dataset by employing conditional VAE and a neural network-based generative model. It estimates the conditional probability distribution with latent variables and data and produces samples for specific cancer types.

C. CONTINUAL LIFELONG LEARNING
One of the main challenges of the computational systems, including computational systems biology, regarding continual lifelong learning is reducing catastrophic forgetting. There are numerous continual learning techniques available to handle this issue, and are distinguished into four types: regularization [39], [46], [59]- [63], dynamic architecture [60], [64], [65], rehearsal [39], [42], [66]- [74], and generative replay [48], [75]- [79]. Many approaches use combinations of these techniques to allow better performance and less computational and memory cost. Regularization defines a loss that constrains weight updates to remember past knowledge when retraining a model. In the Class-IL setting, regularization-based techniques are unable to learn the discrimination between tasks, and no regularization method can learn alone to discriminate classes from different tasks [80]. Dynamic architectures of neural networks, i.e. progressive networks, create new weights automatically when new classes come. New weights learn new tasks and old weights are frozen (not modified anymore) for keeping past information. Rehearsal strategy is another technique to mitigate catastrophic forgetting consisting of storing past samples and replaying them into the model while learning new information. Dynamic architecture and rehearsal techniques are effective techniques but require much memory while increasing the number of new tasks and classes. When past samples are not accessible that are common in the bioinformatics field, rehearsal techniques cannot be used anymore. Instead of storing past samples, generative replay techniques learn models that will produce artificial samples as a memory of previous knowledge.

III. PROPOSED METHOD
In this article, we propose a novel continual learning approach for high-dimensional DNA methylation data, called DGFR, which consists of a memory-efficient IFS and SN. SN uses a deep generative model as our generator and a neural network (NN) classifier as our predictor for cancer classification tasks in a Class-IL manner. IFS is used to select the most relevant features from high-dimensional features by considering all classes and SN is used to learn the distributions of the selected DNA methylation data and then predict cancer types. The overall structure of the DGFR method is shown in Figure 1.
For each task, high-dimensional DNA methylation samples (''High Feature'') and their corresponding labels (''Targets'') from new cancer types are fed into a task network of DGFR as inputs. Firstly, we perform a simple variance-based filtering FS technique (''Feature Selector'') to select the most variable CpG sites (''Low Feature'') across all samples. Secondly, we pre-train a generator network (''Generator'') to learn the distributions of inputs, and sample from it to produce pseudo-inputs. Thirdly, we train a classifier that fine-tunes the pre-trained generator network parameters, to classify cancer types on the selected features and their corresponding labels and produce pseudo-targets. When the training data for previous tasks are not accessible, pseudo-inputs (''Replay'') and pseudo-targets (''Soft-Target'') produced by a memory network can be replayed as inputs.
In practice, mostly no past information is available in bioinformatics because of their data security and privacy. For this reason and memory efficiency, we store selected  Table 1 summarizes symbols and notations used in this article. Given T = {T 1 , T 2 , . . . ,T K of K tasks and D= {D 1 , D 2 , . . . ,D C of C datasets. For example, when T = 6 and C = 12, two datasets are considered in each task. A dataset is denoted by and H i is the number of high-dimensions of i-th dataset, respectively. Here, X j i consists of x j i set of instances and y j i set of labels.
In each task, we incrementally calculate and update the mean (µ) and standard deviation (σ ) for each CpG, which are used for producing normal distribution (N). And so IFS selects the most relevant low-dimensional (L) features X L from high-dimensional raw features based on their variability (σ 2 ). Deep generative models learn the distributions from the selected lower-dimensional data, and then fine-tuning the pre-trained model allows us to perform cancer type prediction. To achieve our goal, we perform IFS and SN networks sequentially, and they contain feature selector (Section III.B), generator (Section III.C.1), and predictor (Section III.C.2) functions, respectively. The algorithm of the DGFR method is explained in Table 2.
In the following sections, IFS and SN (generative and predictive models) networks are explained in detail sequentially.

B. INCREMENTAL FEATURE SELECTION
IFS is a simple variance-based filtering technique that is incrementally performed. High-dimensional data may contain a large amount of irrelevant and redundant information, which may use a lot of memories and greatly degrade the performance of learning algorithms. Therefore, we need to use flexible incremental feature selection techniques that can execute in a memory space efficiently that would be empty in the beginning and update features when new cancer types arrive. So, firstly, we calculate µ and σ incrementally and store only the calculated values instead of whole feature values. In the first task (k = 0), µ, and σ are calculated in Equations 1 and 2 as follows: At k-th task, µ and σ are incrementally calculated without accessing past data in Equations 3 and 4 as follows: As shown in Figure 2, we rank all features based on their σ 2 after the calculations. In DNA methylation analysis, the overall variance of methylation across the samples can be an attractive covariate for filtering. Filtering techniques are commonly used for reducing noise for DNA methylation data with a linear time requirement and are very computationally intensive, especially if building learning models have a high computational cost. But filtering techniques rank the features by only single-feature associations with the class labels, and the number of top-ranked features is determined manually (L).
After selecting top-ranked L features, we mix new class samples into the old class samples that are replayed from previous tasks, called hard replay. If newly selected feature values are not replayed, we produce its normal distribution (N) using its µ and σ , called soft replay. For all tasks, N is calculated in Equation 5 as follows: Then we perform a generator function to learn data samples for each cancer type and a predictor function to predict cancer types, sequentially. In the next sections, we explain a scholar network that can both learn the new cancer types without forgetting its knowledge.

C. SCHOLAR NETWORK
In the scholar network, introduced in [48], the generatorpredictor pair learns the selected low-dimensional features and their corresponding target values, then produces the pseudo-input (replay) and the pseudo-target (soft target) pairs as shown in Figure 3. The produced pairs are mixed with new data samples to update the generator and the predictor networks. It contains a deep generative model (generator) and a NN classifier (predictor).

1) GENERATIVE MODEL
The generative model refers to any model that generates observable samples. In this article, we employ a VAE deep generative model that maximizes the likelihood of generated samples being given a real distribution. The architecture of the VAE consists of encoder and decoder components.
The encoder component comprised an input layer, fully connected encoding hidden layers, a distribution layer, and a latent space layer. Here the distribution layer produces µ and σ vectors. The latent space layer samples d-dimensional latent vectors, which are used as the extracted and learned features called latent variable (z). The encoder function (f enc ) can be summarized in Equation 6 as follows: where q ∅ (z|x) is the approximate posterior of the latent variable z and ∅ is a local variational parameter. The decoder component comprised fully connected decoding hidden layers, and an output layer. The output layer is used as the reconstructed input (x ). The decoder function (f dec ) can be summarized in Equation 7 as follows: where p θ (x|z) is the prior distribution of the latent variable z and θ is a local variational parameter. The objective function of the VAE is to reconstruct the input data as much as possible, to maximize the log-likelihood probability p θ (x), to minimize mean squared error between original data and reconstructed data. The objective function of the generator network (reconstruction loss) is summarized in Equation 8 as follows: where D KL is the KL divergence of the approximate posterior and the prior distribution and ELBO is the variational lower-bound on the marginal likelihood of each data point. The VAE network learns the knowledge about DNA methylation data from the original low-dimensional inputs and tries to reconstruct them that can be replayed for further tasks.

2) PREDICTIVE MODEL
To establish a predictive model, we employ a simple NN classifier followed by the downstream of the generator which fine-tunes the generator network's encoder part and feature extraction layers in an end-to-end manner for the task of cancer type prediction. The predictor function (f pred ) can be summarized in Equation 9 as follows: The objective function of the NN classifier is to predict the true class labels, to minimize the cross-entropy loss between the approximate distribution and the ground truth distribution. The objective function of the predictor network (classification loss) is summarized as shown in Equation 10: The supervised predictor network provides predictions of cancer types for the D datasets as any of the given C cancer types among the K tasks. The predicted targets can be used as soft targets for the further processing of the prediction of cancer types in each task of Class-IL.

IV. EXPERIMENTAL SETTINGS
In this section, firstly, we describe the experimental dataset used in this article. Then we briefly introduce the baseline methods compared with the proposed DGFR method and their hyperparameters. Finally, we show the metric used for evaluating all methods.

A. DATASETS
Our experiments are conducted on twelve (C = 12) publicly available datasets obtained from the Xenabrowser (TCGA) [81] data portal, which have a total of 2,728 samples listed in Table 3  methylation data. To reduce noise, the top L features were chosen by the variance-based feature selection algorithm from DNA methylation beta values (across all cancers) on each task repeatedly. The selected L features are used as input data and sent to the SN. We performed stratified 10-fold cross-validation for model evaluation. The mean and standard deviation values are reported in the experimental results.
We split the experimental datasets into six (L = 6) tasks, where each task has two cancer classification. For example, the first task is the binary classification for BRCA and COAD cancers and incrementally added the other five tasks.
To explore the effect of the number of samples among tasks, three different dataset ordering strategies, such as randomly (Rand), ascending (Asc), and descending (Desc).
In this article, we analyzed the effect of the number of selected features, notated as L, which is set as {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000}. For the feature analysis and the performance evaluation, we selected the 1,000 features with the highest variance across the 2,728 experimental data samples. Figure 4 illustrates the global density of DNA methylation among 1,000 CpG sites for all twelve cancer types, each consists of the 1,000 features of cancer samples.
As shown in this figure, there are significant differences in the methylation levels that drive the classification to good accuracy results. Theoretically, DNA methylation can be divided into three levels: low (hypomethylation), medium, high (hypermethylation) [82]. In general, the density graph shows that hypermethylation and hypomethylation are more than medium methylation for all cancers. That means that most of the CpG sites in this region are hypomethylated and hypermethylated. The density of the hypomethylation is more than the density of the hypermethylation in KIRP, LUAD, LUSC, GBM, BRCA, KIRC, and OV cancer types. By contrast, hypermethylation is more than hypomethylation in other cancers. In some cancers such as STAD, the medium methylation is more than the hypomethylation and hypermethylation levels. However, the differences between some cancers are not showing clearly, for example in LUAD and LUSC cancer types. That makes it difficult for many ML and DL techniques to differentiate between them.

B. BASELINE METHODS
We compared the proposed DGFR method with the stateof-the-art continual learning methods on the experimental datasets in terms of the classification accuracy. The baseline methods are divided into four categories as follows:

Regularization
• Elastic Weight Consolidation (EWC) [46]: The regularization term consisting of a quadratic penalty term for each previously learned task. The number of quadratic terms grows linearly in the number of tasks.

• Online Elastic Weight Consolidation (Online EWC)[62]:
This method is a modification of EWC to determine weight importance by calculating the sum of the previous tasks' Fisher information matrices. Synaptic Intelligence (SI) [63]: This method is similar to online EWC to determine weight importance online during stochastic gradient descent instead of Fisher information.

Dynamic architecture
• Learning without Forgetting (LwF) [60]: This is another type of regularization-based method focused on data that attempts to preserve past learning experiences from old models to a new one through knowledge distillation [59]. That means that dynamic architecture methods create new weights automatically for learning new tasks.

• Averaged Gradient Episodic Memory (A-GEM) [39]:
This method is also another type of regularization method that uses an episodic memory. It replays the stored data as ''exemplars'' from given tasks to perform it computationally and memory-efficient as the regularization methods. A-GEM is an improved version of Gradient Episodic Memory (GEM) [83], by defining inequality constraints to avoid the increase in the losses.
• Incremental Classifier and Representation Learning (iCaRL) [42]: This method uses an episodic memory to replay the stored data and calculate class means. Neural networks are used for feature extraction and classification is performed based on a nearest-class-mean rule [84] in that feature space.
• Experience Replay (ER) [74]: A basic rehearsal method which uses an episodic memory to replay the stored data and uses them to augment the incoming data.

Generative replay
• Deep Generative Replay (DGR) [48]: A dual-model architecture of a deep generative neural network model, which creates pseudo-samples that are then intermixed with recently observed data instead of using stored data. We also employed the dual-model architecture consisting of a deep generative model (generator) and a task solving model (solver).
• Deep Generative Replay with distillation (DGR+distill) [78]: This method is similar to DGR, but instead of labeling the replayed inputs as the most likely class according to the previous tasks' model (hard targets), it pairs them with the predicted probabilities for all target classes (soft targets). We also employed the dual-model architecture with distillation.
• Replay-Through-Feedback (RtF) [79]: The integrated architecture of the generative model and the main model with distillation by equipping it with generative feedback connections. It reduces the computational cost of generative replay. For a fair comparison, we used the same neural network architecture for all the methods that have a multi-layer perceptron with three hidden layers of 1,000 nodes, each with ReLU non-linear activation functions. Except for iCaRL, we used a softmax function as the final output layer and the standard multi-class cross-entropy classification loss for the VOLUME 8, 2020 predictions of the model on the current task data. All models are trained for 5,000 iterations (epochs) per task using the Adam-optimizer (β 1 = 0.9, β 2 = 0.999) [85] with learning rate 0.0001. For each iteration, classification loss is calculated as an average of over 64 samples (same for replayed samples) from the current task. For the generative models, symmetric VAE networks with 100-dimensional stochastic latent variable layers are pre-trained separately on all tasks. The standard normal distribution is calculated as prior. All the hyperparameters used in this article are summarized in Table 4.

C. EVALUATION METRIC
To measure the performance of the proposed DGFR method and compare it with other baseline methods, we used classification accuracy. Accuracy is the ratio of the number of correctly predicted classes to the total number of tested samples and calculated directly from the confusion matrix, which is a specific table that is often used to describe the performance of a classification model. In the confusion matrix, true-positive (TP) and true-negative (TN) is interpreted to correct positive and negative predictions, which are actual correct predictions. False-positive (FP) and false-negative (FN) are incorrect positive and negative predictions. Accuracy is formalized in Equation 10 as follows: All experiments are executed on the Intel Xeon E3 (32G memory, GTX 1080 Ti) hardware platform and the Ubuntu 18.04 computational environment. We thank the authors [86] for the great PyTorch implementations of all of the baseline methods. We used all default parameters except for not listed in Table 4. We also used the Scikit-Learn and Pytorch libraries with Python programming language for all of the analyses.

V. EXPERIMENTAL RESULTS
In this section, we illustrate some experimental results, including feature analysis that is selected by IFS, and performance evaluation that is performed by SN. We also investigate the effect of the number of selected features L to find the optimal values. We then discuss comparative analysis with other baseline methods and the efficiency of the proposed DGFR method.

A. FEATURE ANALYSIS
At each of the six tasks, we incrementally selected the predefined lower number (L) of features based on their variations using IFS. First, we discuss the descriptive statistics of the selected features concerning L = 1, 000 as shown in Figure 5. In these figures, the number of the newly selected features after feature selection and ranking is indicated as the red rectangular bars. And the number of the kept features from the previous tasks is indicated as the blue rectangular bars. Other statistical results are listed in Appendix A.
In the first task, the initial feature sets that have 1,000 topranked features were selected from the first two categories (Asc = {READ, KIRP}, Desc = {OV, KIRC}, and Rand = {BRCA, GBM}). When new data comes, the means and standard deviations were re-calculated, and the variance-based feature ranking was updated. As shown in the left side of Figure 5, the kept features are increasing by small values when the number of samples is increasing in ascending order. In every new task, 250-350 features were newly selected. That shows that a small number of training samples from specific cancer types cannot express the importance of the selected features. As shown in the middle side of Figure 5, the kept features are increasing by larger values when the number of samples is decreasing in descending order. Especially in the last three tasks, only 20-50 features were newly selected. That shows how sample size influences feature selection. As shown on the right side of Figure 5, the kept and newly selected features are changed inconsistently when the number of samples is given randomly. It depends on the order of the cancer types, and it is most common in practice.
We also reported that the top-5 selected features were calculated on their variances in every task of different ordering strategies, as shown in Table 5. For example, the CpG site ''cg11201229'' is the most significant feature for the cancer classification task. But it is not ranked first only from the small number of samples of ''READ'', ''KIRP'' cancers. In the sixth task, 40%, 60%, and 20% of the features were reselected from the first task, respectively, ascending, descending, and random ordering strategies. In the first task, the set of selected features are much different in the ordering strategies. In contrast, the same set of features are selected in the sixth task. That means that the importance of specific CpG sites is different for all cancers, and it is necessary to select them adaptively in each task.

B. PERFORMANCE EVALUATION
We considered four different types of state-of-the-art algorithms in terms of classification accuracy. The average accuracy results of the proposed DGFR method and the compared algorithms are shown in Table 6. We also set L as 1,000. The detailed and other performance evaluation results are listed in Appendix B. We evaluated each model by using stratified    We used the ''None'' method as lower-bound, which was trained sequentially on all tasks in the standard way, also VOLUME 8, 2020 called fine-tuning. And we used the ''Offline'' method as upper-bound, which was trained on the whole data in all tasks, also called joint-training. As we can see, the ''None'' and the regularization-based methods cannot learn the tasks except for the first tasks. Another knowledge distillationbased LwF and rehearsal A-GEM methods work better than the other regularization-based methods but not enough to be satisfied. Other methods achieved good comparable results. The compared results show that the IFS technique is much efficient and can boost classification accuracy. Firstly, when considering the ascending order, DGFR and DGFR+ distill methods achieved an average accuracy of 92.01% and 91.10%, respectively. It has greatly improved the other results of 4.19% and 3.28%, respectively. Secondly, when considering the descending order, the rehearsal-based iCaRL and ER methods achieved an average accuracy of 92.25% and 89.02%, respectively. iCaRL has greatly improved the other results by 2.48%, and ER is also comparable to the generative and the ''Offline'' methods. The proposed DGFR and DGFR+distill methods achieved an average accuracy of 88.95% and 89.77%, respectively. We found that iCaRL is very sensitive with different random seeds, and the other generative models show robust results on the experimental cancer datasets. Finally, when considering the random order, DGFR and DGFR+distill methods achieved an average accuracy of 91.75% and 93.48%, respectively. It has also greatly improved the other results by 1.4% and 3.13%, respectively.
We also reported the classification accuracy results of each task. As shown in Figure 6-8, the left, middle, and right sides show the results of regularization-based, rehearsal and generative, and comparison of DGR and DGFR methods, respectively. The gray color on the left side indicates the regularization-based methods which had failed on the cancer classification tasks except for the first task. The blue color on the left side indicates the LwF method, which failed in the beginning tasks and started learning the next tasks. The orange color on the left side indicates the A-GEM method, which showed satisfactory results in the beginning and failed in the next tasks. The other methods are comparative, and there are small differences shown in the figures.
As shown in Figure 6, the performances of the rehearsal and generative methods are decreasing in the last tasks when considering the number of samples is in ascending order. The reason is that the features in the first tasks were selected from a small number of samples, and those features struggled to generalize models in the last tasks. All methods failed and then showed accuracies of less than 60% in the sixth task except for DGFR. DGFR and DGFR+distill methods achieved an accuracy of 96.43% in the fifth task and 80.00% and 66.67%, respectively, in the sixth task. Compared to this, Figure 7 shows that all methods worked well and then showed accuracies of greater than 60% in all tasks except for the ''Offline'' method in the first task. The reason is that the features were selected from a larger number of samples in all tasks when considering the number of samples is in descending order. All the generative methods (DGR, DGR+distill, RtF, DGFR, and DGFR+distill) achieved an accuracy of 100% in the fifth task. Figure 8 shows the performance evaluation when considering the number of samples is in random order. Depends on the number of samples in random order it shows different results. For example, all methods show lower accuracies in the fifth task. Because of the cancers ''OV'' and ''KIRC'' have a large number of training samples. That means that a large set of features were newly selected, and the previously trained models fail on the new task. DGFR+distill archives an accuracy of 86.67% in the fifth task. As we can see, DGFR and DGFR+distill methods significantly improved the accuracies of the DGR and DGR+distill methods, respectively, in all tasks with different ordering strategies.

C. EFFECT OF NUMBER OF SELECTED FEATURES
For high-dimensional data, finding the optimal number of lower-dimensional features reduced by selection and transformation stages is one of the important steps. We investigated the effect of the number of selected features L, which is set as {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000}, and their optimal values in terms of classification accuracy. The detailed results are listed in Appendix B.    by all baseline and proposed DGFR methods with a different number of features. As discussed above, we illustrated the results of regularization-based, rehearsal and generative, and comparison of DGR and DGFR methods on the left, middle, and right sides, respectively. And we also used the same color combinations for all methods. As shown in the figures, the regularization-based methods failed at all. In contrast, rehearsal and generative methods show satisfactory and comparative results except for A-GEM. But we found that iCaRL is very sensitive with different random seeds, especially considering the number of samples is in random order, as shown in Figure 10. DGFR and DGFR+distill methods significantly improved the accuracy of the baseline methods in all experiments. Figure 9 shows that accuracy is increasing when the number of selected features increases. RtF shows an accuracy of 87.82% with 1,000 features, which is the best result of the baseline rehearsal and generative methods. As compared to this, DGFR and DGFR+distill achieve an accuracy of 93.00% and 93.25%, respectively, which is already satisfied with only 400 features. It also has greatly improved the DGR results by approximately 10.60%. As shown in Figure 10, all the methods achieve satisfactory results with 200 features.  For example, DGFR+distill with 200 features achieves an accuracy of 92.84%, which improved the accuracy of the previous task by 8.85%. And compared with the baseline methods, it improved the accuracy of iCaRL by 2.43%. As same as this, 200-dimensional DGFR+distill shows a satisfactory accuracy of 94.24%, which improved the accuracy of the previous task by 7.90%. It also improved the accuracy of DGR+distill by and 4.53%, as shown in Figure 11. As concluded, we found that the optimal lower number of features L is between 200 and 400. That means that these sizes of dimensions are the most convenient to reduce high dimensional data into lower-dimensional space on the experimental datasets.

D. DISCUSSIONS AND ANALYSIS
Feature selection is one of the most important steps for highdimensional biomedical data. On the other hand, Class-IL is mandatory in the development of computational systems in bioinformatics. Most state-of-the-art Class-IL algorithms are designed for a fixed set of features, e.g., visual features. For cancer classification tasks, CpG sites can be highly significant in specific cancers and not for others. When types of cancers increase, the significance of specific CpG sites can be changed based on their variability. We found that ''cg11201229'', ''cg25600606'', and ''cg27592318'' CpG sites are the highest variable features, and reported the changes among the tasks in Table 5. A predefined set of features cannot express the characteristics of all cancer types, which will come in the future. So it is needed to develop an incremental feature selection algorithm that can handle previously learned features and new features adaptively. In practice, new cancer types with a lower number of samples will be added to the learning system that already learned from a higher number of samples of old cancer types. We prepared different ordering strategies such as ascending, descending, random. Then we compared the baseline and proposed methods for each ordering strategy in terms of accuracy.
In this article, we focused on feature selection from the high-dimensional DNA methylation data by taking the advantage of the current state-of-the-art algorithms. We chose the DGR method because of its generative capabilities. The other generative method is RtF, which was designed for lower computational cost of generative replay, but shows lower accuracy than DGR at most tasks. We aimed to design the proposed method to high accuracy with satisfactional computational time. We also found that iCaRL is very sensitive to different random seeds.
In bioinformatics, data security and privacy are some of the critical challenges. We tested variance-based filtering selection algorithms, which are simple but highly effective for high-dimensional data. We hope that the experimental and analysis results give motivation to other researchers in the field of computational biology. We can see the efficiency of the proposed DGFR method in the experimental results section as we discussed above.

VI. CONCLUSION AND FUTURE WORK
In this article, we proposed a Class-IL learning method, called DGFR, which consists of an IFS and SN. SN contains a deep generative model and a neural network classifier. We used variance-based filtering as a feature selector, VAE as a generator, neural network classifier as a predictor. We ranked the features on their variabilities, and IFS adaptively selects the top-ranked features on each task. VAE pre-trained the generative models on the selected features for further analysis. Finally, we used a simple neural network to classify cancer samples into cancer categories.
We collected a total of 2,728 samples from 12 cancers from the public data portal. The state-of-the-art Class-IL algorithms are evaluated on the dataset and compared with the proposed DGFR method in terms of accuracy. To find an optimal number of features, we set it as {100, 200, 300, 400, 500, 600, 700, 800, 900, 1000}. We chose 200-400 features as optimal values because of their satisfactional performances. The proposed DGFR and DGFR+distill methods significantly improved the accuracies of the DGR, DGR+distill, and other baseline methods. We also tested three different ordering strategies, such as ascending, descending, and random. We achieved the highest average accuracy of 93.20% (400 features), 93.25% (400 features) for ascending 92.74% (300 features), 92.84% (200 features) for descending, and 95.08% (300 features), 95.52% (400 features) for random settings, with the proposed DGFR and DGFR+distill, respectively.
In future work, we will apply the proposed method to the other high-dimensional biomedical tasks in a Class-IL way, e.g., gene expression data. The feature selection step is the most important. We will focus on developing improved feature selection algorithms in terms of performance, memory efficiency, and computational time. As well, deep generative models and classification algorithms will be considered most efficiently.
ERDENEBILEG BATBAATAR received the M.S. and Ph.D. degrees in data mining, medical informatics, and computer science from the Database and Bioinformatics Laboratory, Chungbuk National University, South Korea. He is currently a Postdoctoral Researcher of bioinformatics and computer science with Chungbuk National University. His research interests include software engineering, data mining, big data analysis, bioinformatics, machine learning, deep learning, and their applications.  Researcher. He is currently a Professor with the Faculty of Information Technology, Ton Duc Thang University, Vietnam, an Emeritus and the Endowed Chair Researcher with Chungbuk National University, South Korea, and also an Adjunct Professor with Chiang Mai University, Thailand. He is also an Honorary Doctorate of the National University of Mongolia. He has been not only the Leader of the Database and Bioinformatics Laboratory, South Korea, since 1986, but also the Co-Leader of the Research Group, Data Science Laboratory, Vietnam, since March 2019. He is also the former Vice-President of the Personalized Tumor Engineering Research Center. He has published more than 1000 referred technical papers in various journals and international conferences, in addition to authoring a number of books. His research interests include databases, spatiotemporal databases, big data analysis, data mining, deep learning, biomedical informatics, and bioinformatics. He has been a member of the ACM since 1983. He has served on numerous program committees, including roles as the Demonstration Co-Chair of the VLDB, the Panel and Tutorial Co-Chair of the APWeb, and the FITAT General Co-Chair.