Adaptive Unsupervised Feature Learning for Gene Signature Identification in Non-Small-Cell Lung Cancer

Non-small-cell lung cancer (NSCLC) is the most common type of lung cancer, which accounts for a proportion of nearly 85%. The increasing availability of genome-wide gene expression data has facilitated the identification of gene signatures that are significant to the precise classification of NSCLC subtypes and personalized treatment decisions. Unsupervised feature selection is an effective computational technique for searching the most discriminative feature subset to distinguish different classes and find the potential information embedded in biological data. In this study, we proposed a novel unsupervised feature selection method to identify the gene signatures for NSCLC subtype classification based on gene expression data. The proposed method incorporated linear discriminant analysis, adaptive structure preservation, and $l_{2,1}$ -norm sparse regression into a joint learning framework for unsupervised feature selection to select the informative genes. An effective algorithm was developed to solve the optimization problem in the proposed method. Furthermore, we performed module-based gene filtering before feature selection to reduce the computational cost. We evaluated the proposed method on a gene expression dataset of NSCLC from The Cancer Genome Atlas (TCGA). The experimental results show that the proposed method identified a small number of gene signatures for accurate NSCLC subtype classification. Enrichment analysis of the identified gene signatures was also performed by summarizing the key biological processes.


I. INTRODUCTION
Lung cancer which is a highly lethal malignant disease has become the leading cause of cancer-related death worldwide [1]. Small-cell lung cancer (SCLC) and non-small-cell lung cancer (NSCLC) are the two main types of lung cancer, where NSCLC accounts for a proportion of nearly 85% and has a better prognosis than SCLC [2]. Despite recent therapeutic advances, due to the lack of predictive biomarkers, patients with NSCLC still suffer from bleak outcomes [3]. Many studies have indicated that precise treatment can improve the overall survival of individuals with NSCLC if the subtypes can be identified correctly. It is important to develop novel prognostic biomarkers for subtype classification and treatment optimization in NSCLC.
Recent advances in genome-wide sequencing techniques have enabled the generation of a large amount of gene The associate editor coordinating the review of this manuscript and approving it for publication was Leyi Wei. expression profiles, which facilitate the identification of gene signatures that are significant to the precise classification of NSCLC subtypes and personalized treatment decisions [4]. Many existing studies have addressed to extract gene expression signatures for NSCLC subtype classification including mRNA, miRNA, and lncRNA signatures [5]- [7]. However, directly using the gene profiles as the signatures for NSCLC lung cancer subtype classification usually plagued with redundant information [8].
To reduce the redundant information and remove the useless genes, many feature ranking methods have been applied to select the effective gene signatures [9]- [12]. T-score [13] and Relief-F [14] are two typical feature ranking methods, which consider the genes as individual features and select a set of signatures from the top-ranking for cancer subtype classification. Recursive Feature Elimination (RFE) is a popular feature selection algorithm, which is commonly used with Support Vector Machines (SVM), i.e., SVM-RFE method, to repeatedly construct a model and remove features with low weights [15]. Some variants of SVM-RFE, e.g., by integrating gene expression level and correlation [16] or pathway knowledge [17], have been proposed to alter the ranking criterion [18]. Other method using both expression and network information also has been developed to identify genes that are better indicators for survival [19]. All the above feature ranking methods are supervised, which need class labels or related gene information to select the effective gene signatures for cancer subtype classification.
Unsupervised learning has attracted much attention in recent years, which looks for previously undetected patterns in data without pre-existing labels. Clustering is one of the main methods of unsupervised learning to discover useful information hidden in data [20]. Similar to the clustering methods, unsupervised feature selection has also been utilized to select the informative features/genes which better capture the interesting natural classes of samples [21], [22].
Two of the widely used unsupervised feature selection methods are the filter and embedded methods. The filter methods are simple and fast but ignore the possible feature correlations. Typical filter methods include the max variance (MaxVar) method and the Laplacian score (LapScore) method [23]. In contrast to the filter methods, the embedded methods consider the correlation of features with a learning model simultaneously. A family of methods has been developed to maintain the underlying data structure in the embedded learning processes [24]. These important structures include the global structure [25], [26], the local structure [27], [28], and the discriminative information [29], [30]. However, in most existing unsupervised feature selection methods, the calculation of preserved structures in the embedded space involves all the irrelevant and relevant features, thus the irrelevant features will have adverse effects on the structure characterization for selecting the precise features.
In this study, to select the effective and precise gene signatures for the NSCLC subtype classification, we proposed a novel unsupervised feature selection method which maintains the important data structure by using only the selected features. The proposed method incorporated linear discriminant analysis, adaptive structure preservation, and l 2,1 -norm sparse regression into a joint learning framework for unsupervised feature learning to select the informative genes. In the proposed method, the global structure was captured by the discriminant analysis, and the local structure was revealed by a probabilistic neighborhood graph using only the relevant features. By utilizing l 2,1 -norm regularization to impose row sparsity on the weight matrix, the proposed method optimized for selecting the discriminative genes that were informative for the NSCLC subtype classification. We developed an effective algorithm to solve the optimization problem in the proposed method. Furthermore, module-based gene filtering was performed before unsupervised feature selection to reduce the computational cost. We evaluated the proposed method on a gene expression dataset of NSCLC from The Cancer Genome Atlas (TCGA). The experimental results demonstrate that the proposed method identified a small number of gene signatures for accurate NSCLC subtype classification. We also performed an enrichment analysis of the selected gene signatures by summarizing the key biological processes.

II. ADAPTIVE UNSUPERVISED FEATURE SELECTION A. NOTATIONS
The gene expression dataset is recorded as a data matrix X = [x 1 , x 2 , . . . , x n ] T ∈ R n×m , where x i ∈ R m denotes the ith sample and n is the number of samples. By using g 1 , g 2 ,. . . , g m to denote the m genes, the data matrix can also be denoted as X = [g 1 , g 2 , . . . , g m ]. Unsupervised feature selection is performed with the objective to select the d (d < m) most informative genes that can distinguish the samples originating from different classes.
Assume that the n samples are from c classes. Denote L = [l 1 , l 2 , . . . , l c ] ∈ {0, 1} n×c as the label matrix, where l i = [l 1i , l 2i , . . . , l ni ] T ∈ {0, 1} n×1 is a label vector related to class i, i.e., l ji = 1 if x j is in class i and l ji = 0 otherwise. The scaled class indicator matrix is defined and calculated as

B. LINEAR DISCRIMINANT ANALYSIS
Linear discriminant analysis is to project the data matrix X to a low-dimensional space by a linear transformation matrix W . Thus, the data matrix is transformed to W T X in the low-dimensional space. Let W = [w 1 , . . . , w m ] T ∈ R m×q , where w i is the i th row of W . The total scatter matrix S t and the between-cluster scatter matrix S b are defined as [31] where µ is the mean of the n samples, µ i is the mean of the samples in class i, n i is the number of samples in class i, X = XH n is the centered data matrix of X by H n = I − 1 n 1 n 1 T n . By minimizing the within-cluster distance and maximizing the between-cluster distance in the lower dimensional space, the objective function of linear discriminant is calculated as (4)

C. ADAPTIVE STRUCTURE PRESERVATION
Most existing methods preserve the local structure by constructing a k-nearest neighbor graph. However, the k-nearest neighbor graph is constructed based on all the irrelevant and relevant features, which make the captured local structure to be inevitably affected by the irrelevant features. We considered to preserve the local structure by constructing a probabilistic neighborhood graph based on only the relevant features. VOLUME 8, 2020 Similar to the methods in [32], [33], we used the Euclidean distance to calculate the probabilistic neighborhood. Assume that x i is connected to x j with probability p ij , 0 ≤ p ij ≤ 1. The probabilities of all samples being connected to x i satisfy n j=1 p ij = 1. Let P = (p ij ) n×n . p ij can be calculated by solving the following optimization problem.
where λ is the regularization parameter. The regularization term is used to add a prior uniform distribution and avoid the trivial solution. Note that a small distance x i − x j 2 will lead to a high probability. Thus, With such a nice property, the probability neighborhood can be used for local structure preservation. That is, in the low-dimensional space W T X , the probabilistic neighborhood can be preserved. Thus, we have We proposed a novel unsupervised feature selection method to select the informative genes for accurate NSCLC subtype classification. The proposed method utilized adaptive structure preservation for unsupervised feature selection. Thus, we referred to it as the AUFS method. By incorporating linear discriminant analysis, adaptive structure preservation, and l 2,1 -norm sparse regression into a learning framework, the objective function of the proposed AUFS method can be formulated as where α and β are two balanced parameters. F is constrained to be nonnegative [25], and the condition of F = L(L T L) −1/2 is relaxed to F T F = I c . W T S t W = I is set to avoid the trivial solution by constraining W to be uncorrelated with respect to S t [34]. The term W 2,1 in (7) is set to ensure that W is sparse in rows. The i th row w i corresponds to the weight of gene g i . Thus, the sparsity constraint on rows makes W suitable for gene selection [31]. Each gene is ranked according to w i 2 in descending order and the top genes will be selected.

E. OPTIMIZATION ALGORITHM
It is difficult to derive the closed solution of the optimization problem in (7) directly, since it contains three different variables with different regularizations and constraints. We developed an iterative algorithm to convert the problem into three sub-problems by updating one variable while fixing the other two variables.
We replaced S t withXX T and replaced S b withX FF TX T in (7) according to (2) and (3), respectively. We also replaced F . Therefore, equation (7) can be rewritten as min where γ > 0 is a parameter and it should be set large enough to ensure the orthogonality.
When F and P are fixed, the optimization problem in (8) for updating W is equivalent to the following problem. min Denote U ∈ R m×m as a diagonal matrix with the i th diagonal element being U ii = 1 2 w i 2 . Since ∂ W 2,1 ∂W = 2UW , we constructed an auxiliary function and replace W 2,1 with W T UW in (9), the problem is equivalent to The optimization problem of (10) can be solved by the following generalized eigenproblem.
The solution of (10) is the matrix W ∈ R m×q in which the column vectors are the q smallest eigenvectors corresponding to the q smallest eigenvalues in (11). We further normalized W as (W TXX T W ) ii = 1, i = 1, . . . , q.
2) UPDATE F BY FIXING W AND P When W and P are fixed, updating F is equivalent to the following problem.
Since Tr(W TX FF TX T W ) = Tr(F TX T WW TX F), (12) can be rewritten as Following [24], by denoting M = −X T WW TX , F can be updated by multiplicative rules, as Then, we normalized F to satisfy (F T F) ii = 1, i = 1, . . . , n.
3) UPDATE P BY FIXING W AND F When W and F are fixed, updating P in (8) is equivalent to the optimization problem in (6). Let g ij = W T x i − W T x j 2 . Let g i ∈ R n×1 denote a vector with the j th element being g ij . Let p i ∈ R n×1 denote a vector with the j th element being p ij . Let 1 n ∈ R n×1 denote a vector with all of its elements being 1. The vector form of (6) can be written as The Lagrangian function of (15) is where µ and σ i are the Lagrangian multipliers. Based on the KKT condition [35], we can obtain the optimal solution p ij as Following [32], [33] and assume that g i1 , g i2 , . . . , g in are ordered from large to small, to satisfy that each sample has only k nearest neighbors, the regularization parameter λ can be set based on the number of nearest neighbors k, as Compared to the parameter λ, k is much easier and more intuitive to tune. Thus, λ can be better handled by searching k. Then, based on (18), p ij can obtained as

4) ALGORITHM
The procedure of the proposed AUFS method was summarized in Algorithm 1. It will stop when the objective function of equation (8) tends to a constant or the change is very close to zero. The most time consuming operation of Algorithm 1 is to solve the generalized eigenproblem in (11). The time complexity of this operation is O(m 3 ) approximately, where m is the number of features. The optimization procedure in Algorithm 1 will monotonically decrease the objective function in (7) in each iteration. Since the objective function has lower bounds (e.g., 0), the above iteration will converge. Besides, empirical results showed that the convergence of Algorithm 1 was fast.
In the experiments, only several iterations (no more than 15 iterations) were needed to reach convergence.
The top d ranked genes are selected.

III. MATERIALS A. DATASET
We used a published gene expression dataset that has been studied in [16], which was primarily obtained from TCGA and generated by high-throughput techniques. This dataset contains two major subtypes of NSCLC: lung squamous cell carcinoma (LUSC) and lung adenocarcinoma (LUAD). The objective is to distinguish these two subtypes of NSCLC. The samples without sufficient expression data have been excluded and finally 1013 samples were enrolled in the study, where 501 samples belonged to the LUSC subtype and 512 samples belonged to the LUAD subtype. Then, the differentially expressed (DE) genes in each subtype were identified with Edger [36]. All the DE genes were annotated based on the Ensembl database at http://asia.ensembl.org/. By selecting the genes (i.e., mRNAs and lncRNAs) that were differentially co-expressed in both LUAD and LUSC subtypes, 5469 genes (i.e., 3924 mRNAs and 1545 lncRNAs) were obtained for gene selection and classification.

B. INITIAL FILTERING OF GENES
To reduce the computational cost of the proposed feature selection method, we filtered out the genes that were less relevant to the two subtypes of NSCLC. Similar to the previous VOLUME 8, 2020 study in [16], we constructed a dissimilarity matrix based on the topological overlap matrix (TOM), which reflected the gene-wise similarity. A hierarchical clustering tree with all the 5469 genes was then constructed based on the dissimilarity matrix, by which highly connected and co-expressed genes were clustered into some co-expression modules. The first principal component of each module, i.e., the eigengenes, was used to analyze the correlation between the modules and the subtype information. Only the genes in the most relevant modules were kept for further selection. The proposed AUFS method was then performed to select the candidate genes from each module and finally identified a small number of genes from the candidate genes as the informative genes for the NSCLC subtype classification.

A. INITIAL FILTERING OF GENES
We performed module-based gene filtering before applying the proposed feature selection method, with the objective to reduce the computational cost and filter out the genes that were less relevant to the two subtypes of NSCLC. We used the methods in [16] to calculate the dissimilarity matrix, and performed hierarchical clustering to divide the 5469 genes into multiple modules. The relationship between the eigengenes of each module and the labels of two subtypes of NSCLC were analyzed by WGCNA [37]. The WGCNA function was implemented by 'WGCNA' (v1.63) [38]. Figure 1 shows the correlation between the modules and the two subtypes of NSCLC (i.e., LUAD and LUSC). 17 modules denoted by different colors were identified by hierarchical clustering. The modules that were most relevant to LUAD and LUSC are turquoise and grey. We used the genes in turquoise and grey modules for further selection using the proposed AUFS method. Finally, 960 genes in turquoise and 1882 genes in grey were reserved and would be used as input to AUFS.

B. EXPERIMENTAL SETTINGS 1) EVALUATION METRICS
The proposed method was evaluated based on accuracy, sensitivity, and specificity, which are defined as follows. and where TP is True Positives, i.e., the number of positive samples correctly classified as positive, TN is True Negatives, i.e., the number of negative samples correctly classified as negative, FP is False Positives, i.e., the number of negative samples incorrectly classified as positive, and FN is False Negatives, i.e., the number of positive samples incorrectly classified as negative [39].

2) PARAMETER SETTINGS
There are five parameters in the proposed method, i.e., the number of nearest neighbors k, the projected low-dimensions q, and three parameters in the objective function (8), i.e., α, β, and γ . In the experiments, k was set as 5, q was set as 2, and γ was set as 2 6 . α and β were tuned over {1, 10 1 , 10 2 , 10 3 , 10 4 , 10 5 , 10 6 }. We utilized four classifiers, i.e., k-nearest neighbor (kNN) [40], Decision Tree (DT) [41], Support Vector Machine (SVM) [42], and Linear Discriminant Analysis (LDA) [43], for NSCLC subtype classification with the selected genes. We used the toolbox of Matlab to run the four classifiers with default settings. The training data and test data were set as 70% and 30% of the total samples, respectively. We created cross-validation partition for the samples using Matlab function ''cvpartition''. Unsupervised feature selection was performed to rank the genes.
The classifiers were trained and tested using the top-ranked genes. All experiments were repeated 100 times and the mean results of test data were reported.

C. SELECTING CANDIDATE GENES FROM TWO MODULES
The proposed method was performed to select the candidate genes from the turquoise and grey modules, respectively.

1) COMPARISON WITH UNSUPERVISED FEATURE SELECTION METHODS
We compared the proposed AUFS method with several state-of-the-art unsupervised feature selection methods to select the candidate genes from the two modules. The compared methods includes LapScore [23], MCFS [26], NDFS [27], UDFS [30], and GLFS [24]. After initially selecting 960 genes in the turquoise module and 1882 genes in the  grey module, different unsupervised feature selection methods were performed to rank the genes in each module, respectively. The top-ranked genes were picked for training and testing. The number of picked genes varied from 1 to 20. The comparison results by the kNN and SVM classifiers in the turquoise and grey module are shown in Figures 2 and 3, respectively. The proposed AUFS method outperforms other compared methods. The curve of AUFS stays above the five other curves in each case. Note that in the turquoise module, the proposed method can maintain high accuracies when the number of selected genes is no more than three, which is a significant improvement compared to other methods. That is because the proposed method preserves the local structure based on only the relevant features, avoiding the adverse effects by the irrelevant features on the structure characterization for selecting the precise features. Table 1 lists the top 10 genes ranked by the proposed method in the turquoise and grey modules, respectively. These genes are obtained by setting α = 10 6 , β = 10 2 in the turquoise module and α = 10 5 , β = 10 2 in the grey module. The 20 genes were selected as the candidate genes for further selection. The proposed method was performed on the 20 candidate genes again to select the most informative genes as the signatures. We ranked the 20 candidate genes by the proposed method, and the top-ranked candidate genes were picked for training and testing. The number of picked candidate genes varied from 1 to 20. The classification results by the four classifiers (i.e., kNN, DT, SVM, and LDA) on the top-ranked candidate genes are shown in Figure 4. We can see from Figure 4 that all the four classifiers obtain the best results (i.e., highest accuracies) when the number of picked candidate genes equals to 17. These 17 candidate genes are considered to be the most relevant to the two subtypes of NSCLC. Therefore, we identified these 17 candidate genes  as the signatures and listed them in Table 2. Among them, 9 genes were from the turquoise module and 8 genes were from the grey module.

D. ANALYSIS OF SELECTED SIGNATURES 1) COMPARISON WITH OTHER NSCLC SUBTYPE CLASSIFICATION METHOD
Su et al. [16] proposed a gene selection method (called WGRFE) to select the gene signatures to distinguish LUSC from LUAD. They embedded the WGRFE and the standard RFE with SVM and RF to rank the genes, respectively. Their results demonstrated that RF-WGRFE achieved better performance than SVM-WGRFE. RF-WGRFE identified 13 gene signatures. We compared the proposed AUFS method using the 17 identified signatures with RF-WGRFE using the 13 identified signatures. Table 3 presents the comparison results by using four classifiers, i.e., kNN, DT, SVM, and LDA. Both AUFS and RF-WGRFE identified a small number of gene signatures. The genes KRT6A, KRT16, SPRR1B, KRT6B, PERP identified by AUFS were also identified by RF-WGRFE. We can see from Table 3, the proposed AUFS method performs better than RF-WGRFE in most cases.

2) FUNCTIONAL ENRICHMENT ANALYSIS
We used Gene Ontology (GO) [44] to analyze the biological meaning of the gene signatures identified by the proposed AUFS method. The GO enrichment analysis was implemented by using the packages in [45]. Specific GO including biological process (BP), cellular component (CC) and molecular function (MF) were investigated. We analyzed the enrichment of the 17 identified gene signatures in Table 2. The threshold for significant enrichment was set as p-value=0.05. Figure 5 shows the significantly enriched GO terms. Each bar represents a GO term and a higher bar means a higher degree of enrichment. We can see from figure 5 that 24 terms were obtained, which includes 10 BP terms, 11 CC terms, and 3 MF terms. 13 out of the 17 signatures were verified to present biological meaning and the top significantly enriched GO terms are shown in figure 6. Compared to the 13 gene signatures identified in [16], the 17 identified gene signatures obtained lower p-values (as shown in figure 5) and more signatures were verified to present biological meaning (as shown in figure 6). We also note that cornification and keratinocyte differentiation are the most significant terms. The term keratinization has been reported to be associated with LUSC in some previous studies [46].

V. CONCLUSION
The increasing availability of genome-wide gene expression data has facilitated the identification of gene signatures for precise NSCLC subtypes classification. Most existing feature selection methods applied to signature identification are supervised, which need class labels or related gene information. To utilize the useful information hidden in data, we proposed a novel unsupervised feature selection method to identify the most discriminative gene signatures for the NSCLC subtype classification. The proposed method incorporated discriminant analysis, adaptive structure preservation, and l 2,1 -norm sparse regression into a joint learning framework to select the informative genes. We developed an effective algorithm to solve the optimization problem in the proposed method. Furthermore, we perform module-based gene filtering before feature selection to reduce the computational cost and filter out the genes that were less relevant to the subtypes of NSCLC. We evaluate the proposed method on a published gene expression dataset of NSCLC from TCGA, which contained two major subtypes LUAD and LUSC. The experimental results demonstrate that the proposed method identified a small number of gene signatures for accurate subtype classification: i.e., distinguishing LUSC from LUAD. Enrichment analysis of the selected gene signatures was also performed by summarizing the key biological processes. The results show that cornification and keratinocyte differentiation are the most significant GO terms.
Although the proposed method is applied to identify the gene signatures for the NSCLC subtype classification based on gene expression data, the proposed method is generally applicable to other types of biological data and other types of tumors. In this study, we mainly focus on distinguishing the two subtypes LUSC and LUAD of NSCLC. In future work, we will apply the proposed method to classify more other subtypes of NSCLC and also other types of biological data.