Two-Stage Prediction of Comorbid Cancer Patient Survivability Based on Improved Infinite Feature Selection

The modeling of comorbid cancer patients’ survivability has theoretical significance and practical needs. Cancer survivability prediction may provide guidance for clinical decision making and personalized medicine. The Surveillance, Epidemiology, and End Results(SEER) program provides large data sets suitable for analysis with machine learning methods. In this study, we consider survival prediction to be a two-stage problem. The first is to predict the five-year survivability of patients. For those whom the predicted outcome is ‘death’, the second stage predicts the remaining survival time. Male and female comorbid cancer cases(male-genital and urinary cancer for men and breast and female-genital cancer for women) were identified from the SEER database and labeled. In the classification stage,the dataset was processed with improved infinite feature selection(Iinf-FS) and random undersampling-based data balancing. These two methods resolved the issues of biased data set and poor classification accuracy. In the lifespan prediction stage, unsupervised infinite feature selection (UinfFS) was applied. The results indicate that the proposed method is effective.


I. INTRODUCTION
Due to the promotion of cancer screening, developments in medical science, and improvements in supportive care, cancer prognosis has improved significantly.The 5-year cancer survival rate in 2016 was twice as high as that in 1950 [1]. Cancer survivors have a higher risk of developing a second primary cancer, and the rate was estimated to be 14% higher than the probability of developing a primary cancer among people who had never had cancer [2]. The increasing number of cancer survivors and the aged population are the main causes for the increasing number of multiple primary cancer (MPC) patients. The concurrence of multiple different cancers is also called cancer comorbidity.
Comorbidity was defined by researchers to represent coexisting noncancer diseases [3], while others used the term to represent the comorbidity of cancer and noncancer diseases. In [4], comorbidity represented coexisting cancers. An investigation of historical cancer cases reveals that certain types of cancers have higher correlations. Urinary and male genital The associate editor coordinating the review of this manuscript and approving it for publication was K. C. Santosh . cancers are the most common codeveloping cancers; breast and female genital cancers are the third most common cancer comorbidity [5].
Cancer survival prediction has been a popular research interest. Accurately predicting patients' survivability may provide doctors with medical advice and help prescribe personalized medicines. Survivability is the possibility of a patient being alive for more than five years from the time of cancer diagnosis. It is a medical indicator for evaluating treating effects. Most cancer survivability studies aim to predict patients' five-year survivability. These studies provide a limited amount of information for medical decisions. If a patient's prediction is 'death', the survival time of the patient remains unknown. Survival time prediction should be studied to provide more precise information for medical decision making [6].
Cancer survival studies are challenging due to the lack of large-scale medical data that are available to the public. The Surveillance, Epidemiology, and End Results (SEER) program is an open source database providing de-identified, coded, and annotated information on cancer statistics in the United States [7], [8]. The scale of the data VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ is large enough to be analyzed with machine learning techniques [9]. This article aims to predict the survival time on a monthly measure. However, the prediction of survival time has been shown to be challenging since large generalization errors often occur when one-stage regression models are used [10], [11]. To resolve this issue, a two-stage prediction model is proposed. At the first stage, a classifier is adopted to predict whether the patients can survive for more than five years. At the second stage, a regression model is used to predict the survival time of patients who have been classified as not being able to survive for five years.
In the classification stage, two problems are encountered: the problem of biased dataset and poor classification performance. To demonstrate the problem of bias, a survival time histogram is used as an example in the next section, and the classification performance with support vector machine is calculated. To resolve this issue, random undersampling is adopted to balance the training dataset. To deal with poor classification performance, improved infinite feature selection (Iinf-FS) is proposed to cascade with the support vector machine classifier. The comparing feature selection methods for two-stage classifiers are: CHI2, feature selection via eigenvector centrality (ECFS), and mutual information based feature selection [12]. These feature selection methods are available publically [13]. In the regression stage, the above improvements cannot be applied because the predicted outcome is continuous. However, the error rate is high, and the training time is long without data preprocessing. Moreover, Iinf-FS cannot be used because it requires class labels. Thus, an unsupervised infinite feature selection(UinfFs) method is adopted instead. UinfFs does not require class information. The comparing unsupervised feature selection methods for the regression stage are: Nonnegative Spectral Analysis (NDFS) [14], L2 1-norm regularized discriminative feature selection(UDFS) [15], [16], and PCA-based contribution matrix related feature selection [17]. The main contributions of this article are as follows: • Consider the survivability prediction problem to have two stages.
• Create two cancer comorbid datasets from the SEER database: breast and female genital cancers for women and urinary and male genital cancers for men.
• Apply Iinf-FS to feature selection in the classification stage.
• Apply random undersampling to balance the training data in the classification stage.
• Apply UinfFS in the regression stage.
• Compare the two-stage classification and regression model with the one-stage regression model. One-stage prediction is also compared in this study. It has the classification accuracy 81.76% and regression RMSE value 22.566. The performance of the one-stage method in both classification and regression tasks are worse than the proposed two-stage framework's performance. In the classification stage, the prediction accuracy of improved two-stage method is 86.42%, which is higher than that of the original linear support vector machine(Linear-SVM)(82.52%), and other state of the art improving methods. In the second stage, the RMSE of the improved random forests(RF) method is 13.267, better than the original RF method's RMSE of 13.529 and other feature selection methods.
This remaining article is organized as follows. Section 2 introduces the related works on cancer comorbidity, data balancing, feature selection, data cleaning, and the application of machine learning techniques to cancer survival prediction. Section 3 provides the detailed methods and experimental procedures. Section 4 presents the experimental results while section 5 presents the discussion of the results. Section 6 concludes the paper and presents possible future research directions.

II. RELATED WORK
Lynch et al. predicted the survival prognosis of lung cancer patients via supervised machine learning techniques [7]. Researchers used machine learning models to predict breast cancer patients' survivability [18]- [20].A Gaussian k-base Naive Bayes (NB) classifier system was proposed by Kaviarasi et al. to enhance the lung cancer classification accuracy of the NB classifier and linear regression [21]. Few scholars have focused their attention on patients with cancer comorbidity. Urinary and male genital cancers are the most common cancer comorbidities, and breast and female genital cancers are the third most common concurring cancers. The prediction of these cancer cormorbidities is essential in cancer studies. Here, cancer survivability refers to the possibility of a patient being alive after 60 months since the diagnosis of the second cancer.
Machine learning techniques have been widely applied in medical fields [18]- [20]. Liu et al. applied an improved clustering algorithm for sample cutting to improve the training sample category representation capability. The experimental results indicated that the method improves the classification efficiency [22]. Ryu et al. compared machine learning techniques with statistical methods to predict survivability in spinal ependymoma patients and discovered that lower grade histology and higher extent of surgical resection were the key prognostic factors; they concluded that therapeutic factors are associated with improved overall survival. Machine learning methods generally perform better in prediction tasks, but the datasets areheterogeneous and complex, with numerous missing values [23]. Researchers studied several recently published papers on breast cancer survival prediction. Stage-specific prediction tasks were performed using machine learning methods. Stage-specific prediction models and joint models were created and compared, concluding that the data-driven knowledge obtained with machine learning methods must be subject to over-time validation before it could be clinically and professionally applied [24].
Zolbanin et al. used cancer data collected from the SEER database to create two comorbid datasets: one for breast and female genital cancers and another for prostate and urinary cancers. Several popular machine learning techniques were then applied to the resultant datasets to build predictive models. The results showed that having more information about the comorbid conditions of patients improved the predictive power of the model [5]. In addition to Zolbanin's study, Naghizadeh et al. also conducted research on comorbid cancer patients focusing on data preprocessing, including feature selection and data cleaning. The feature selection procedure was performed by applying the least angle regression, least absolute shrinkage and selection operator, and stepwise regression algorithms. The accuracy was increased, and the error rate was decreased [25]. Some researchers considered the survival prediction of colorectal cancer patients in the SEER database to be a two-stage process: the first stage was to predict survival, and the second was to predict the remaining life span of the patients whose predicted outcome was 'death'. The first stage adopted a tree ensemble classification method that took into account the imbalanced data. The regression stage used a tree-based selective ensemble regression method called SRRT-SEM [6].

III. MATERIALS AND METHODS
SEER is an open-source database that provides deidentified, coded, and annotated information on cancers in the United States [26]. The database is large enough to provide plenty of cases to be analyzed with machine learning techniques. The cases were selected from the Incidence 0 SEER 18 Regs Research Data, November 2018 Sub database. The cancer diagnoses in the SEER cancer registries wereverified clinically or microscopically by a recognized medical practitioner by law [27].

A. TWO-STAGE PREDICTION
Most cancer prognosis studies are limited to predicting whether a patient can live for a specific amount of time. The patient is then classified as 'survived' or 'dead'. Since gastric cancer has a high mortality incidence, most cases would be classified as 'dead'. The survival time for these patients remains unknown. Therefore, we propose a two-stage classification model consisting of a classification model that predicts the patient's survivability, and a regression model that predicts the remaining lifespan of the patients whose predicted outcome is 'dead' [6].
Both stages have similar procedures except for the base machine learning types. Linear-SVM classifiers are adopted in the classification stage to predict the survival condition, and RF regressors are used in the regression stage to predict survival months.
In the classification stage, two problems are encountered. The first problem is that the biased training set would result in a biased classifier. The cases from the minority class would be misclassified as the majority class. To solve this problem, data balancing is required. Random undersampling is used to balance the datasets. The second problem is that the feature pool is large and the classification result is poor. Iinf-FS is applied to the the cascade with a support vector machine classifier selecting a subset of features from the pool. The cascaded system has better classification performance than the original classifier.
The steps of the classification framework are as follows: 1) Collect the data on breast, urinary, male genital and female genital cancers from the SEER database. 2) Combine the data and rearrange the order of the data.
3) Divide the data into training, validating, and testing sets. 4) Select optimal features for modeling according to IInf-FS. 5) Apply random undersampling to balance the biased training set. 6) Apply the linear-SVM classifier to the prediction. 7) Evaluate the predicted outcomes with error criteria: accuracy, f-score, specificity and sensitivity. The procedures of the regression framework are as follows: 1) Remove cases in the classification data whose survival months are greater than 60. 2) Divide the data into training, validating, and testing sets. 3) Select optimal features for modeling according to UinfFS. 4) Apply RF regressor to the prediction. 5) Evaluate the predicted outcomes with error criteria: root mean squared error(RMSE),mean absolute error(MAE), and R 2 score.

B. CREATING THE DATASETS
Breast, female genital, male urinary, and male genital cancer cases were extracted from the SEER data registries. Cases with 'positive histology', 'complete dates', and 'active follow-up' were chosen, while cases with 'autopsy only', 'death certificate only', 'unknown cause of death', and 'unknown stage' were excluded. Benign and in situ cancer patients were excluded since they have a higher likelihood of being cured and should be treated differently. There were 72,121 urinary cancer cases, 47,795 male genital cancer cases, 99,443 breast cancer cases, and 28,539 female genital cases. Searching for the same patient ID in urinary and male genital cancer datasets, 2,699 cases of male patients with multiple primary cancer were found. In the breast cancer and female genital cancer datasets, 1275 cases of female patients had the same patient ID. Among all 3975 patients with cancer comorbidities, 1,062 lived more than 5 years after the diagnosis of the second primary cancer. The training and testing data were then separated at a ratio of 4:1. A total of 2917 patients who lived for more than 60 months were excluded from the regression stage dataset. Figure 1 shows the distribution of the patients' survival months. Approximately one-fourth of the patients lived for more than 60 months since the diagnosis of the first incidence of cancer.
Seer registry provides many attributes but only few are related to this research. Table 1 lists the attributes selected for this study. 25 features are selected in the base datasets.

C. IMPROVED INFINITE FEATURE SELECTION 1) THE IMPROVED SUPERVISED FEATURE SELECTION
Feature selection is the process of selecting a subset from the feature group. In the application of machine learning, there is always a large number of features in the feature pool. Some features may be irrelevant to the prediction label, and some may be dependent on each other. The irrelevant and redundant features may lead to a lengthy training time, over complicated models, and it is hard to be generalized. In this study, all the data are one-hot encoded. Forty-four features become 1297-dimensional features consisting of only zeros and ones. Regarding the dimension of the data, feature selection is necessary.
Given a training set F represented as F = f (1) , . . . , f (n) , a nondirected fully connected graph G = (V , E) can be built. The vertices V correspond to all features, while the edges E represent the pairwise relations between features. G can be represented as an adjacency matrix A whose elements a ij (1 ≤ i, j ≤ n) represent the pairwise relationship between features [28]. The elements are called pairwise probabilistic energy terms, defined as: where α ∈ [0, 1], i is the normalized standard deviation of elements that belong to f (i) and the positive class, and σ − i is the standard deviation of elements that belong to the same feature but the negative class. p(c + ) and p(c − ) are the corresponding probabilities of each class. The first term d(i) measures the dispersion between positive and negative classes. A large dispersion between classes implies a large distance between samples in different classes. Therefore, to classify samples, we need to maximize the term d(i). The second term is c ij = 1 − |corr| with corr representing Spearman's correlation coefficient between f (i) and f (j) .
σ ij , αc ij ∈ [0, 1], so the two terms are comparable in magnitude. a ij takes into account two features and measures their feature dispersion and their correlation. Defining γ ij denotes a path of length l between nodes i and j through other features, we can then estimate the probability as To account for the energy of all paths of the same length l, P l i,j is defined as the set of all paths of l between i and j, which is equavalent to: A single feature's energy score can be estimated by summing all paths of all possible lengths of l and expanding l to infinity. However, if A is divergent, adding infinite A l together results divergence. A regulatory factor is needed. It ensures divergence of equation (7).Š Through matrix algebra it can be efficiently solved as: The final scores for each feature term are calculated as follows:š where e is an array of ones. [12] This method not only considers the classification information of the features but also takes into account the positive/ negative correlation between features and classes.

2) THE UNSUPERVISED INFINITE FEATURE SELECTION
In the regression stage, the prediction target is continuous that does not belong to specific classes. Thus, an unsupervised feature selection method is needed. Iinf-FS from the previous subsection is a supervised method. Equation (10) is the same as Equation (1), it has two components. (10) σ ij represents the dispersion between features.c ij represents the correlation between features. Unsupervised feature selection process does not provide labels in the calculation. The correlation part requires no class labels but the dispersion part does. Thus, the dispersion part is removed to be unsupervised. a ij = c ij (11) The rest of the ranking calculation remains the same. Repeating equation (3) to equation (9), the scores of the features are calculated. Sorting the scores, the final ranking is obtained.
The calculation of the ranking does not include the prediction target. Thus the relevance of the reduced feature subset to the predicting label is not measured. An element-wise feature dropping method to measure the importance of each feature is adopted to resolve this issue. One feature component is excluded from the subset in each round. The root mean quared error(RMSE) of each round is calculated and compared with the original predictive error when all feature components are considered. If the RMSE decreases, the feature component can be dropped. The features ranks high in the Unsupervised-InfFS process and high RMSE values remain in the final feature subset. In this study, the top ten features are kept regardless of their dropping error. Other features can be dropped if the RMSE dereased when it was removed from the feature pool. Table 2 presents the feature contribution score obtained, the RMSE if each feature is dropped from the feature groups, and the feature selection result.

D. RANDOM UNDERSAMPLING FOR BIASED CLASS BALANCING
As mentioned in section III.B, the classification data are biased such that only 1/4 of the cases are positive. When training a model with biased data, the classifier would show the bias. Different methods can be used to deal with biased data [29]. The two most common methods are undersampling and oversampling. In oversampling, the cases in the minority class are duplicated to balance the dataset. In undersampling,the bias in the dataset is controlled by removing samples from the majority class. Random undersampling is simple and has shown similar performance to other complex methods [29]. In this study, we use a random undersampling method to alleviate the bias in the dataset.
A random value between 0 and 1 is assigned to each training case. If the case belongs to the majority class and the randomly assigned value is greater than 0.3, this case is deleted from the training dataset.

E. EXPERIMENTAL PROCEDURES
A total of 3180 and 795 cases were included in the training and testing datasets, respectively. The dividing ratio was 4:1. When combining the features of the two cancers, some features were identical. Excluding identical features from the combined feature pool, 40 features were selected and transformed to one-hot encoded 1297-dimensional data consisting of zeros and ones only. In the classification stage, Iinf-FS reduced the data dimensionality, while the random undersampling process reduced the training cases. Linear SVM was adopted as the classifier. The compared feature selection methods in the classification stage are: CHI2 score based feature selection, ECFS, and mutual information based feature selection [13]. In the regression stage, patients who survived longer than 60 months were removed from the whole dataset. The unsupervised Inf-FS was applied because its unsupervised nature is compatible with the regression process. The compared unsupervised feature selection methods include: UDFS, NDFS [16], and PCA based feature selection [17]. These comparing methods are also utilized with element-wise feature dropping RMSE score. The top ten features are kept. Other features are dropped if the RMSE scores decrease when they are removed from the pool.
In each iteration, the training set trains the classifier and the accuracy score of the testing set is stored to be compared.

F. ALGORITHMS APPLIED
The classification utilized linear SVM as the classifier. It constructed a separating hyperplane between two classes and then classifies samples based on distance. The data has been one-hot encoded, consisting of zeros and ones only. Linear SVM satisfied the requirement for separating one-hot encoded data. It cascaded with Iinf-FS and random under sampling. The regression stage utilized RF as the regressor. As a commonly used bagging regressor, and it was cascaded with contribution score based feature selection.

G. PERFORMANCE METRICS
The classification accuracy is quantified as recognition accuracy, specificity, and sensitivity. The formulas are as follows:

IV. RESULTS
The performance for both stages are included in this section. Table 3 is the confusion matrix for the classification results. As mentioned in the previous sections, the improvement of the classification consists of improved infinite feature selection(Iinf-FS) and random undersampling-based data balancing. The one-stage regression-based classification performance is also included to be compared with.  Table 4 includes the classification performance metrics for our proposed two-stage linear SVM improved with Inf-FS and data balancing with comparing methods. The compared methods consist of a one-stage RF regression-based classifier, two-stage linear SVM improved with CHI2-based feature selection [30], two-stage linear SVM with ECFS [31], and two-stage linear SVM with mutual information-based feature selection [32]. Random undersampling-based data balancing is applied to all the two-stage classifiers. Except for the general performance metrics, including accuracy, specificity and sensitivity, the number of selected features for each feature selection method, the number of training samples after undersampling, and the data preprocessing time are recorded and compared. Figure 2 contains the scatter plots for original two-stage random RF, UinfFS based RF, UDFS based, NDFS based, and PCA based feature selected RF for comparison. One-stage regression's scatterplot is also included. The corresponding RMSE, MAE, and R 2 score values are in Table 5. Table 3 is the confusion matrix for the one-stage regressor-based classifier, two-stage original SVM classifier, Inf-feature selected data with SVM classifier, and SVM with both improving procedures. The one-stage regressor-based classifier predicts patient survival time. The labels are then classified as 0 for patients whose prediction is less than 60 months and 1 for patients whose predicted survival time is greater than 5 years. From the confusion matrix, it can be seen that the one-stage regressor-based classifier is good at classifying the majority of patients. However, for the minority of patients whose label is 1, it misclassified 80 out of 212 patients to be 0. Slightly better than the one-stage model, the unimproved two-stage linear SVM classifier selected 518 out of 583 'dead' cases correctly. However, it did not handle the 'survived' cases well, as 61 out of 212 positive patients were mistakenly classified as 'dead' patients. The biased result was caused by the bias in the training set. Applying Inf-FS improves the classification performance among both 'survived' and 'dead' patients. The number of misclassified cases decreased in both classes. The feature selection process improves the classification accuracy, while data balancing deals with biased datasets. Data balancing causes negative effects on cases with negative labels in the majority class. More negative patients are classified as positive. The classification performance of the minority class improves significantly when random undersampling is applied. These two improving procedures solved the two problems in the classification stage. The final model is unbiased and has the best accuracy score. Table 4 confirms the above observation. The one-stage model's accuracy is much weaker than that of the two-stage unimproved method. Both the accuracy score and sensitivity score are lower than those of the two-stage model. Inf-FS was compared with state-of-the-art feature selection methods. All these feature selection methods can improve the classification performance. Among them, ECFS and the proposed Inf-FS method had the best performance. After applying random undersampling-based data balancing to the training set, Inf-FS obtained the best accuracy score and sensitivity score. ECFS was the the second best, but the execution time for ECFS was longer than that for Inf-FS.

A. CLASSIFICATION STAGE
Comparing the proposed method's results with the results from [5], the classification accuracy (86.42% for all patients) is much better than the results from [5] (77.8% for breastfemale genital and 73.48% for male genital-urinary).

B. REGRESSION STAGE
In the regression stage, the RMSE, MAE, and R 2 score are calculated to compare models. To match the results for the one-stage regressor, patients whose actual survival times were greater than 60 months were excluded. Table 5 shows that all the performance metrics for the one-stage regressor are worse than those of the two-stage method. In the previous subsection, the classification performance of the one-stage model is also weaker than that of the two-stage model. This result confirms the necessity for the proposed two-stage method. As shown in Table 5, the proposed UinfFS method is compared with the state-of-the-art unsupervised feature selection method. The UinfFS two-stage regressor has the lowest RMSE, MAE and the highest R 2 . This method is also easy to understand. It reduces the dimensionality of the data before the one-hot encoding procedure, so the deleted features and the remaining features can be named and visualized.
In Figure 2, the UinfFS method has the fewest outliers and shows linearity.The UDFS, NDFS methods have similar scatterplots but more outliers. The PCA based feature selection method has the weakest performance among comparing methods. The one-stage method looks messier than two-stage method. The RMSE values in Table 5 confirm our observations.

VI. CONCLUSION
Male genital-urinary and breast-female genital cancers are common MPCs. Predicting the survivability of MPC patients can help doctors, patients, and families. However, researchers have paid very little attention to MPC patient survival prediction. Thus, the survival prediction of MPC patients has become essential in cancer studies.
The proposed two-stage model provides even more information than survivability. The first stage predicts patients' five-year survivability. If the prediction is 'death', which is common in cormorbid cancer cases, the second stage predicts the remaining lifespan of the patient.
In this research, male genital-urinary and breast-female genital cases were identified and labeled to study survival prediction. In the classification stage, Iinf-FS was carried out, and random undersampling-based data balancing was applied. In the regression stage, UinfFS was applied in the feature selection part. The results indicate that each proposed step improves the predictive power.
This study focused on genital MPC and proposed a novel feature selection. The feature selection can be further improved by studying interdisciplinary and intradisciplinary dispersions. In the future, we will keep studying feature selection methods that may improve current prediction performance. We may also study other MPCs such as second primary lung cancers.
PENG LIU (Member, IEEE) was born in 1990. She received the bachelor's degree from the Swanson School of Engineering, University of Pittsburgh, in 2012, and the master's degree from the School of Electrical Engineering, University of Washington, in 2014. She is currently pursuing the Ph.D. degree with Southeast University. Her research interests include data mining, medical data analysis, medical image analysis, machine learning algorithms, and deep learning.

SHUMIN FEI (Member, IEEE) was born in 1961.
He received the Ph.D. degree from Beihang University, Beijing, China, in 1995. From 1995 to 1997, he did Postdoctoral Research in the Research Institute of Automation, Southeast University, Nanjing, China, where he is currently a Professor. His research interests include the analysis and synthesis of nonlinear systems, robust control, adaptive control, and the analysis and synthesis of time delay systems. VOLUME 8, 2020