Prediction of Second Primary Lung Cancer Patient’s Survivability Based on Improved Eigenvector Centrality-Based Feature Selection

Modeling of second primary lung cancer (SPLC) patients’ survival prediction has important theoretical significance and practical needs. Cancer survivability prediction may provide advice for better clinical decisions and personalized medicine. The Surveillance, Epidemiology, and End Results (SEER) program provides large data sets for analysis with machine learning methods. SPLC cases are identified and labeled from the SEER database; the data set is then preprocessed with improved eigenvector centrality-based feature selection (IECFS). The IECFS method utilizes interclass and intraclass dispersions and the ranking criteria. By adjusting the value of the $\alpha $ parameter and the number of features selected, the method achieves the best performance. The experiment is divided into five folds. This method yields a prediction accuracy of 90.998% for the five-year survivability that is higher than the original classification accuracy (89.16%) and the other state-of-the-art feature selection methods. For the three-year survivability, the proposed methods yields a prediction accuracy of 83.16%, slightly outperforming all of the compared methods. The method is effective and generalizable.


I. INTRODUCTION
In the past 70 years, cancer prognosis improved markedly due to the promotion of cancer screening, development of medical technologies, and advances in supportive care. In the United States, the 5-year relative survival in 2016 was estimated to be 70%, twice as high as that in 1950s [1]. The number of cancer survivors in the United States will grow from 16.9 million in 2019 to a projected number of 22.1 million in 2030, accounting for 5% of the total population [2]. Due to the improved prognosis as well as aged population, multiple primary cancer (MPC) diagnoses for the same person are increasingly common. The risk of cancer survivors developing a second primary cancer was estimated to be 14% higher than that of the general population [3]. In the United States, one in five cancers diagnosed today occurs in an individual The associate editor coordinating the review of this manuscript and approving it for publication was Chun-Hao Chen .
with a previous history of cancer [4]. On the other hand, MPC is much harder to treat due to a narrower range of options. For example, MPC patients may have previously received a maximum life-time dose of certain chemotherapy or the same part of the body may have undergone radiotherapy because of previous cancers [4], [5]. The prevalence and limited treatment options have made MPC an important issue for research, clinical treatment, and public health.
Second primary lung cancer (SPLC) has been the most common MPC, representing 25% of second primary malignancies [6]. In the Surveillance, Epidemiology, and End Results (SEER) program between 1992 and 2008, 1,450,837 non-pulmonary cancer survivors were identified, among whom 25,472 developed SPLC at a mean (standard deviation) follow-up of 5.7 (3.6) years. More than half (57%) of patients with SPLC died of the disease [6]. SPLC ranked only second to the same-site MPC in cases of prostate cancer and female breast cancer, the most common cancers among men and women, respectively [4]. The relatively high prevalence of SPLC is ascribed to the risk factors associated with MPC.
Most SPLC studies have focused on predicting the initial primary lung cancer patients' risks of developing SPLC. Other MPC survival prediction studies have been limited to genital MPCs. They applied statistical method to improve the prediction accuracy. Research on survival prediction of diagnosed SPLC patients has been lacking. Cancer survival rate prediction may provide guidance for better clinical decisions and personalized medicine. SPLC is the most common multiple primary cancer. However, few researchers have focused on the survival prediction of multiple primary cancer patients. Thus, prediction of the SPLC survival rate has become essential in cancer studies. Survivability often refers to the likelihood of a patient being alive after five years since the time of cancer diagnosis. It is an indicator in medical science for the evaluation of the treatment effectiveness. The method proposed in this paper predicts five-year and three-year survivability of SPLC patients. The novel IECFS method is applied to select features and to improve prediction accuracy.
The main contributions of this article are as follows: • Identify and label the SPLC cases from SEER database and study the survival prediction of them; • Apply improved eigenvector centrality feature selection; • Compare prediction results with different amount of features and α; • Compare prediction results with different feature selection methods; and • Compare prediction results in different folds. The average 5-year prediction accuracy of the proposed method is 90.998%, higher than the original method's accuracy (89.16%) and the results obtained by the compared FS methods. The 3-year prediction accuracy is 83.16%, higher than the original prediction accuracy (81.07%). The remainder of this article is organized as follows. Section 2 introduces the related works on SPLC, feature selection and the application of machine learning techniques to cancer survival prediction. Section 3 provides detailed methods and experimental procedures. Section 4 presents experimental results while section 5 presents the discussion of the results. Section 6 concludes the paper and presents possible future research directions.

II. RELATED WORK
MPC has been studied in epidemiology. The most important category of risk factors for MPC is life-style factors, such as tobacco smoking and alcohol consumption. According to nine SEER registries, an estimated one-third of MPC happened in tobacco-and alcohol-related sites between 1975 and 2000 [3], [4]. Tobacco smoking was defined as a Group 1 human carcinogen according to the International Agency for Research on Cancer (IARC) [7]. It is the leading risk factor for lung cancer at large [8]. Tobacco smoking is linked to approximately 80-90% of lung cancer deaths [9]. The other most important type of risk factor for MPC is prior cancer treatment, such as chemotherapy and radiotherapy. It was reported in nested case-control studies among the European and North American populations that larger numbers of chemotherapy cycles with alkylating agents elevated the risk of lung cancer among Hodgkin lymphoma survivors [10]. It was suggested that 8% of MPC is due to radiotherapy [11]. Prior chemotherapy was also observed to additively enhance the increased risk of second primary lung cancer by the previous radiotherapy among multiple types of cancer survivors [10], [12], [13].
The analysis of big data in health care and medical fields has immense potential for improving the quality of care, reducing medical waste, and reducing the burden of care [14]. Machine learning techniques have been widely applied to medical big data to predict outcomes [15], [16]- [18]. Liu et al. applied an improved clustering algorithm for sample cutting to improve training sample category representation capability. The experimental results indicated that the improved method improves the classification efficiency [19]. Ensemble learning methods that train a number of weak base learners and then combine their outputs are popular in medical prediction research [20]. Many researchers conducted their studies on cases collected from the SEER database. A Gaussian k-based naive Bayes (NB) classifier system was proposed by Kaviarasi et al. [21] to enhance the classification accuracy of the NB classifier and linear regression algorithm. They proposed an online gradient boosting learning with adaptive linear regressor and compared its performance with state-of-the-art machine learning algorithms.
Some researchers compared machine learning techniques with statistical methods to predict survivability for spinal ependymoma patients. They discovered that lower grade histology and greater extent of surgical resection were the key prognostic factors and concluded that therapeutic factors are associated with improved overall survival. Machine learning methods are generally better for prediction, but the data set was heterogeneous and complex with numerous missing values [22]. Several recently published papers on breast cancer survival prediction were analyzed together for application to stage-specific prediction tasks. Stage-specific prediction models and joint models were created and compared. It was concluded that data-driven knowledge obtained with machine learning methods must be subject to over-time validation prior to its clinical and professional application [23]. Roffo et al. proposed several feature selection methods. Reference [24] introduced an infinite feature selection method exploiting the convergence properties of power series of matrices. The Spearman's rank correlation coefficient and the standard deviation were combined and utilized. Replacing the standard deviation measure with dispersion criteria and applying the improved method to multiple primary cancers, we proposed a two-stage prediction method [25]. In 2017, Roffo et al. proposed a feature selection method via eigenvector centrality [26]. They built a graph to measure class separability based on mutual information and standard deviation. Later, they proposed a new feature infinite latent feature selection method [27].
Zolbanin et al. used cancer data collected from the SEER database to create two comorbid datasets: one for breast and female genital cancers and another for prostate and urinary cancers. Several popular machine learning techniques were then applied to the resultant data sets to build predictive models. The results showed that the availability of more information about the comorbid conditions of patients improved the predictive power of the model [28]. In addition to Zolbanin's study, Naghizadeh et al. also investigated comorbid cancer patients focusing on data preprocessing, including feature selection and data cleaning. The feature selection procedure was performed by applying the least angle regression, least absolute shrinkage and selection operator, and stepwise regression algorithms. They compared the performance of four machine learning techniques for survivability prediction of prostate cancer. It was found that neural network outperformed decision tree, naive Bayes, and support vector machine learning. The accuracy was increased, and the error rate was decreased [29]. In this study, they investigated the survivability for female and male MPCs. However, they did not discuss the number of features selected in their article. In our previous research on the MPC patients' survivability, 150 features were selected. Currently there is no survival prediction model for patients with SPLC. Reference [30] estimated the trends in 5-year incidence of metachronous SPLC and stablished a risk prediction model to identify candidates with high SPLC risks. Reference [31] estimated the 10-year risk of developing second primary lung cancer (SPLC) among the survivors of initial primary lung cancer (IPLC). This paper will study feature selection through grid searching in the Results and Discussion Section.

III. MATERIALS AND METHODS
Cancer survivability prediction has been challenging due to the lack of publicly available large-scale medical data.
SEER is an open-source database that provides de-identified, coded, and annotated information on cancers in the United States [6], [7]. The scale of data is large enough for analysis. To predict the 5-year survival rate among SPLC patients, non-pulmonary cancer survivors with lung cancer as the second primary lung cancer were selected from the SEER (Incidence0 SEER 18 Regs Research Data, Nov 2018 Sub) database. The lung cancer survivors were excluded from the study because of their relatively poor prognosis. (The lung cancer 5-year survival rate is 21.2%, lower than most of other cancer types [32].) The cancer diagnoses in SEER cancer registries were all by law verified clinically or microscopically, by a recognized medical practitioner [33]. The diagnosis of non-pulmonary cancers preceded the lung cancer diagnosis for each individual subject in this study. Figure 1 is the flowchart of the proposed framework. The steps are as follows: 1) Collect the data on non-pulmonary cancers and pulmonary cancer from the SEER database; 2) Combine and label the data to create the SPLC data set, and determine the survival rate; 3) Divide the data into 5 folds of the same size, repeat (4)-(6) five times so that each portion has been used as the testing set; VOLUME 9, 2021 4) Select optimal features for modeling according to IECFS; 5) Apply linear SVM as the classifier; 6) Record the predicted outcomes with the following error criteria: accuracy, sensitivity, specificity, NPV, and AUC and go to the next fold; 7) Calculate the average of the performance metrics; and 8) Compare the averaged metrics for different α values and different amount of features. The case numbers are shown on the left side of Figure 1. After creating the SPLC data set, 6422 patients are divided into five sub-groups of similar sizes. In the first fold, the first four subgroups are used as the training set and the fifth is used as the testing set. After going through IECFS and SVM classifier, the test results are recorded. This process is repeated five times so that all of the subgroups are tested and recorded. Taking the average of the five folds, the final results are achieved. Adjusting α and the number of features for IECFS, the optimal results can be found.

A. DATA ACQUISITION
The clinical data are collected from the SEER database. The SEER program collects cancer data throughout the United States with the goal of reducing the cancer burden ultimately. SEER-Stat is a software developed to provide easy access to data analysis [6].
Non-pulmonary cancer cases are extracted from the SEER database first. Cases with 'positive histology', 'complete dates', 'active follow-up' are chosen, while cases with 'autopsy only', 'death certificate only', 'unknown cause of death', and 'unknown stage' are excluded. Benign and in situ cancer patients are excluded since they can be cured at a much greater chance and should be treated differently. Non-epithelial skin cancer patients are excluded due to its low mortality rate. Cases diagnosed after 2014 are excluded in order to predict the study participants' five-year survivability. Pulmonary cases are extracted from the SEER database to form a second data set. Most of the selection criteria are the same except that only pulmonary cancer patients are chosen.
The SEER database provides many attributes. Some of the attributes are similar to each other, while some unrelated to this research. Table 1 lists the key attributes selected. 26 features are selected in both the non-pulmonary data set and the pulmonary data set in this research. Researchers have previously studied the survival time prediction for lung cancer patients [34]. They selected 19 features from the SEER database. This study investigated the survivability prediction for SPLC patients. 18 features from [34] were selected and 8 more are kept. The only feature not selected is Radiation. This feature is no longer available in the database. The added 8 features include: patient ID, marital status, state-county, behavior, race, year of diagnosis, and sex. The patient ID feature is added to select the SPLC patients from the SEER database. Marital status, state-county, behavior, race, and sex have been shown to be related to the patients' prognosis [6]. The year of diagnosis is very important since cancer prognosis has been improved markedly and rapidly [1]. Most features are discrete and others are continuous. Table 1 includes the chosen features and a brief description of each feature. Discrete features are one-hot encoded and the continuous features remains unchanged prior to feature selection.

B. COMBINING DATA FOR SPLC CASES
Approximately 413,138 patients are identified with first incident cancers diagnosed before 2014 that meet the inclusion criteria. Then, 88,569 patients with lung cancer are included in the second data set. To find patients with SPLC, the records with the same IDs are extracted from the two sets of data. If lung cancer is diagnosed after the first incident cancer, the record is considered as a SPLC and is added to the final data set. Dropping patient IDs and survival months, and including marital status, sex, age, race, and state-county, the rest of the chosen features are kept for feature selection described in the next paragraph. Survival in months since the diagnosis of second primary lung cancer is considered to be the patient's survival time. The survival time tab is transferred to five-year and three-year survivability to be predicted. Patients who lived over 60 months are labeled as 1, while others are labeled as 0 in five-year survivability prediction. In three-year survivability prediction, the 1 label is given to patients who lived over 36 months.
Approximately 6,422 cases remain after the selection process described above. Figures 2 and 3, and Table 2 display the distribution of the patients' survival time. About one-tenth of the patients lived more than 60 months since the diagnosis of the first incidence cancer. About one quarter of the patients lived more than 36 months.

C. FEATURE SELECTION ACCORDING TO THE IMPROVED EIGENVECTOR CENTRALITY FEATURE SELECTION
Feature selection, also called feature subset selection, or attribute selection, is the process of selecting a subset from the feature group to improve model performance. In the application of machine learning, a large number of features is always present. Some features may be irrelevant to the label, and some may be dependent on each other. The irrelevant and redundant features may lead to lengthy training time, over-complicated model, and low generalization. The original ECFS model jointly considers the variation of the features (maximum standard deviation over two features) and the relation of the two features to the class (mutual information). This method ranks each feature f j according to the score s j through calculation as the priority of each feature to be selected. In the actual construction of the model, n features can be selected from the top to the bottom by priority.
Specifically, given a training set F represented as F = f (1) , . . . , f (n) , a nondirected fully connected graph G = (V , E) can be built. The vertices V correspond to all features, while the edges E represent the pairwise relations between features. G can be represented as an adjacency matrix A whose elements a ij (1 ≤ i, j ≤ n) represent the pairwise relationship between features [24]. The elements are called pairwise probabilistic energy terms that are defined as: In ECFS, they use: where k is the mutual information part and is the maximum of the standard deviation of the two features. Next, ECFS configure the priority for each feature by quantifying the path probability passing through a feature node. To measure the discriminative power of a single node, all possible paths that go through the node must be considered. γ ij denotes a path of length l between nodes i and j through other features, we can then estimate the probability using Eqs.
[3]- [5]: To account for the energy of all possible paths of the length l, P l i,j is defined as the set of all paths of l between i and j: which is equivalent to: It was proven that as l approaches a large number L, A l e converges to v 0 [26]: −a n1 −a n2 · · · λ − a nn (7) VOLUME 9, 2021 The eigenvectors can be then calculated by solving Eq. (8): The absolute value of v ij represents the score of the j th node to the i th node. The score of the j th node can be calculated with equation (8). v ij denotes the j th element of the eigenvector v i , i = 1, . . . , n: The original ECFS method use standard deviation and mutual information to measure the complexity of two features and their relationship. However, these two parts are not comprehensive. We introduce our IECFS method where the A matrix is different. The adjacency matrix A containing a ij of the graph G is defined as: where D b and D w represent interclass dispersion and intraclass dispersion.
D b represents interclass dispersion, which is the difference between the samples in two classes. D w is the intraclass dispersion, representing the variance of the samples in the same class. D t is the total dispersion, equivalent to the sum of D b and D w . The dispersion between cases (D t ) reflects the average difference between the samples. Large intraclass dispersion (D w ) reflects large difference between different classes, while large interclass dispersion (D b ) reflects large variance between the samples in the same class. The intraclass and interclass dispersion represent the samples' separability in two aspects.
α ∈ [0, 1] is a hyperparameter that balances the importance of D w and D b . It can be adjusted to improve classification performance.
To optimize the results of IECFS, we need to find the best performing α and combination of features. The optimal combination was found through grid-searching. All of the accuracies and AUCs of α ∈ [0.1, 0.9] at the step of 0.05 and number of features n ∈ [100, 150, 200] were calculated and are listed in Tables 3 and 4. The elements in matrix A represent the discriminative power when i th and j th elements are jointly considered.

D. EXPERIMENTAL PROCEDURES
The data are then randomly divided into five subgroups. Four out of the five subgroups are used as the training set and the rest of the data are used as the testing set. The experiment is repeated five times so that each subgroup has been used as the test data. 5138 and 1284 observations were included in the training and testing data sets, respectively. When combining the features of the two cancers, some features were the same. Excluding these features from the feature pool, 40 features were kept and transformed to one-hot encoded 1687-dimensional data of zeros and ones. The improved-ECFS reduced the data dimensionality. Different values of α were adopted for comparison. Linear SVM was adopted as the classifier. The compared feature selection methods in the classification stage are as follows: mutual information-based feature selection, pairwise correlation-based feature selection method, and the original ECFS [35].

E. PERFORMANCE METRICS
The classification accuracy is quantified as recognition accuracy, precision, and recall. The formulas are as follows: True positive (TP) represents the patients who lived longer than 60 months and were predicted to do so, True negative (TN) represents the patients who did not survive up to 60 months and whose prediction was also negative. FP (false positive) is the number of patients who did not live up to 60 months but were predicted to be positive, and FN (false negative) is the number of patients whose labels are 1 but are predicted to be 0. Accuracy is the ratio of all correctly predicted cases to the total sample. Specificity measures the ability of the classifier to predict negative cases and is the fraction of the correct negative predictions over all negative samples. The sensitivity measures the classifier's performance for positive cases and is the fraction of correct positive predictions over all positive samples. These two metrics are commonly applied to medical classifiers. NPV is the abbreviation for negative predictive value. It reflects the probability that a predicted negative is a true negative [36].

F. SIMULATION SETUP
This proposed method is implemented in MATLAB 2015b. The operating system is 64 bit windows 10. The RAM memory is 16 GB, and the processor is an Intel(R) Core(TM) i7-6700HQ CPU @2.60GHz 2.59 GHz. The compared feature selection methods can be found in [27].
The classifier adopted is support vector machine. The MATLAB fitcsvm function with default sequential minimal optimization (SMO) algorithm is adopted.

IV. RESULTS
In the feature selection step, each feature is assigned a score. The features are then ranked according to the score. The feature with the highest score ranks the first and the feature with the lowest score ranks last. The number of features are then selected according to the ranking. If N features are selected, then the top N features are selected. Tables 3 and 4 include the grid-searching results for the best α and number of feature selections combination. The α values ranging from 0.1 to 0.9 are tested. The number of feature selections are 100, 150, and 200. The best combinations are marked in bold in the tables. Tables 5 and 6 contain the confusion matrices of TP TN FP and FN, the performance metrics including accuracy, specificity, sensitivity, the negative predictive value (NPV), and  the area under curve (AUC). The best metrics with the same amount of feature selection are also marked in bold. In addition to the proposed improved eigenvector centrality-based feature selection method (IECFS), the compared feature selection methods are mutual information-based feature selection (muteInf) [37], correlation-based feature selection (CFS), and the original eigenvector centrality-based feature selection (ECFS) [26].    The AUC values do not match their accuracy scores. In some cases, the combination with high accuracy has low AUCs.
After selecting the best α and N, the performance of different feature selection methods are compared in Tables 5 and 6. The proposed IECFS method has the best accuracy scores among all of the feature selections. The improvement is not significant. This is caused by the five-fold averaging process.
The results for all five folds are recorded and averaged for the comparison. The IECFS has the best metrics except for the specificity. CFS has the poorest prediction but the best specificity. The three-year survival prediction also proves that the IECFS is the most suitable feature selection method. Interestingly, when 100 features are selected, the ECFS method achieves very good predicting result that is only slightly worse than that of the IECFS. The improvement of the three-year survivability is not significant but it is consistent in every fold. The proposed IECFS method achieves the best accuracy score in all folds. This proves the robustness of our model.
The comparison of the feature selection methods shows that the mutual-information-based feature selection and the correlation-based feature selection methods have poorer prediction outcome. However, when the number of the selected features increases, their performances improve. When only 100 features are selected, the MuteInf method has worse performance than the original prediction. As the number of features increase, it performs better than the original method. CFS's performance also improves with increasing number of features. However, when 250 features are selected, it is still worse than the original method. The ECFS and IECFS methods are different. In 3-year prediction, when 100 features are selected, these methods have the highest accuracy score when different number of features are selected. In 5-year prediction, when 250 features are selected, they have the poorest accuracy when different numbers of features are selected. This is caused by their ranking criteria. Figures 4 and 5 shows the ROC curves for one of the five folds for three-year and five-year survival prediction. Three of the curves are close to each other. An examination of the data presented in Tables 5 and 6 shows that the AUC values of mutual-information-based feature selection, ECFS, and IECFS are close to each other.
As mentioned in the previous sections, the test is fivefolded. The average performance metrics are calculated and compared. The five folds of survivability prediction are also plotted in Figures 6 and 7 Figure 6 with Figure 7, we observe that the IECFS displays a larger improvement of the prediction for five-year's survivability in all folds. The three-year's survivability prediction has a smaller improvement but it outperforms other methods in all folds.

VI. CONCLUSION
SPLC is the most common MPC. Predicting the survivability of SPLC patients can help the doctors, patients, and families. However, few researchers have studied the survival prediction of MPC patients. Thus, the prediction of SPLC survival rate has become essential in cancer studies. In this research, SPLC cases were identified and labeled to study survival prediction. The proposed IECFS method outperforms the state-of-the-art feature selection methods.
The IECFS method outperforms the compared methods in all five folds, proving that the IECFS method is robust and generalizable. The improvement of the IECFS method over the original ECFS method is moderate, but consistent. The IECFS method outperforms the ECFS method for a wide range of numbers of features.
This study focused on SPLC and proposed a novel IECFS method. We did not apply any data balancing method or data cleaning method because these methods introduce randomness into the prediction. In future work, we will consider utilizing the methods. Feature selection can be further improved by jointly considering multiple feature selection methods through statistical voting. In the future, we may also study other MPCs' survivability. We may also study the risk of developing MPCs after the initial primary cancers.
PENG LIU (Member, IEEE) was born in 1990. She received the bachelor's degree from the Swanson School of Engineering, University of Pittsburgh, in 2012, and the master's degree from the School of Electrical Engineering, University of Washington, in 2014. She is currently pursuing the Ph.D. degree with Southeast University. Her research interests include data mining, medical data analysis, medical image analysis, machine learning algorithms, and deep learning.
KEXIN JIN is currently pursuing the Ph.D. degree with the Department of Epidemiology, University of California, Los Angeles (UCLA), CA, USA. She has enriched experience in the design and analysis of causal inference studies, such as clinical trials and real-world evidence, late phase clinical trials, relational medical claim databases, international surveys, case-controls, and cohort studies. She has published two first-author research articles in the International Journal of Cancer and Translational Oncology. Her research interests include causal inference in lung and colorectal cancers etiology, national policy, and health outcomes. She is currently an Active Member of the American Society of Preventive Oncology and the International Lung Cancer Consortium. She was awarded the Conrad N. Hilton Scholar at UCLA.
YIPING JIAO was born in 1990. He received the bachelor's degree in automation from Jiangnan University and the master's degree in automation from Southeast University, where he is currently pursuing the Ph.D. degree in medical image analysis and digital pathology.