Identification of lncRNA Signature Associated With Pan-Cancer Prognosis

Long noncoding RNAs (lncRNAs) have emerged as potential prognostic markers in various human cancers as they participate in many malignant behaviors. However, the value of lncRNAs as prognostic markers among diverse human cancers is still under investigation, and a systematic signature based on these transcripts that related to pan-cancer prognosis has yet to be reported. In this study, we proposed a framework to incorporate statistical power, biological rationale, and machine learning models for pan-cancer prognosis analysis. The framework identified a 5-lncRNA signature (<italic>ENSG00000206567</italic>, <italic>PCAT29</italic>, <italic>ENSG00000257989</italic>, <italic>LOC388282</italic>, and <italic>LINC00339</italic>) from TCGA training studies (<italic>n</italic> = 1,878). The identified lncRNAs are significantly associated (all <italic>P</italic> <inline-formula><tex-math notation="LaTeX">$\leq$</tex-math></inline-formula> 1.48E-11) with overall survival (OS) of the TCGA cohort (<italic>n</italic> = 4,231). The signature stratified the cohort into low- and high-risk groups with significantly distinct survival outcomes (median OS of 9.84 years versus 4.37 years, log-rank <italic>P</italic> = 1.48E-38) and achieved a time-dependent ROC/AUC of 0.66 at 5 years. After routine clinical factors involved, the signature demonstrated better performance for long-term prognostic estimation (AUC of 0.72). Moreover, the signature was further evaluated on two independent external cohorts (TARGET, <italic>n</italic> = 1,122; CPTAC, <italic>n</italic> = 391; National Cancer Institute) which yielded similar prognostic values (AUC of 0.60 and 0.75; log-rank <italic>P</italic> = 8.6E-09 and <italic>P</italic> = 2.7E-06). An indexing system was developed to map the 5-lncRNA signature to prognoses of pan-cancer patients. In <italic>silico</italic> functional analysis indicated that the lncRNAs are associated with common biological processes driving human cancers. The five lncRNAs, especially <italic>ENSG00000206567</italic>, <italic>ENSG00000257989</italic> and <italic>LOC388282</italic> that never reported before, may serve as viable molecular targets common among diverse cancers.


I. INTRODUCTION
I N THE precision medicine era targeted molecular therapy is the main strategy for the management of cancer patients. Recent basket and umbrella trials demonstrate that actionable mutations are considered as an important predictor of tumor response, and the same molecular alteration can be effectively controlled across different cancers [1]. Hence, a deep understanding of the molecular events which underlie various biological behaviors is increasingly needed.
LncRNAs are transcripts longer than 200 nucleotides encoded by the genome without protein translation potential [2]. More than 75% of transcripts in the genome are noncoding RNAs [2], [3]. Increasing evidence indicates that lncRNAs function as critical mediators for many aggressive biological behaviors in human cancers [4]- [8]. For example, HOX antisense intergenic RNA (HOTAIR) has been found to be critical in driving multiple malignant behaviors, such as proliferation, migration and invasion, suppression of drug response, and genomic instability [9]. Several studies have also identified the prognostic value of lncRNAs in human cancers [10], [11]. In particular, the lncRNA PCA3/DD3 is uniquely expressed in prostate cancer tissues compared to normal tissues and has already been tested This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ as a biomarker in clinical settings [12]. Such studies provide convincing evidence for the identification of a lncRNA-based signature that associated with the prognosis of multiple cancers. However, a single lncRNA is not sufficient to reflect the complexity of cancer biological behaviors. Therefore, a study focused on the expression patterns of several lncRNAs could be more accurate and informative.
The Cancer Genome Atlas (TCGA) consists of original sequencing data from various sources, provides the opportunity to perform integrated studies on the commonalities and differences between diverse cancers. This multi-cancer (pan-cancer) approach to the analysis of such datasets has accelerated the study of the disease and improved treatment efficacy for different types of human cancers [13]- [15]. Through analysis of large-scale TCGA datasets, multiple lncRNAs have been identified as critical factors in gene regulatory network perturbations [16]. Besides, hundreds of lncRNAs have been proposed to be involved in oncogenic genes and pathways in each tumor context [17] and mounting evidence from the study of individual cancers has also established lncRNAs as cancer-specific prognosis predictors [18]- [20]. Although lncRNAs are commonly seen as dysregulated in a tumor-specific manner, a recent study demonstrated that some of the lncRNAs involved in multiple tumor contexts [21]. Studies of the lncRNA-disease association also suggested that similar diseases tend to be associated with functionally similar lncRNAs [22], [23]. Recent studies provide further justification for the importance of cancer prognosis in the field of pan-cancer research [24]- [26]. Meanwhile, another research group established a seven-gene signature for the prediction of patient prognosis in 13 cancer types [27]. However, the potential lncRNAs as pan-cancer prognostic markers have not yet been rigorously tested. Such results might provide insight into the biological processes involved in cancer initiation and development shared by many different human cancers. Therefore, the main objective of this study is to identify pan-cancer prognostic lncRNAs that related to common and critical biological processes driven diverse human cancers. To do that, we propose a pan-cancer prognosis analysis framework by incorporating statistical power, biological rationale, and machine learning model as a whole biomarker identifier. As a result, the framework identified five prognostic lncRNAs associated with overall survival (OS) of pan-cancer studies. An indexing system was developed to map lncRNA signature to the OS of different types of cancer studies. A 5-lncRNA risk score model was established which stratified patient studies into high-and low-risk groups with significantly distinct survival outcomes. The prognostic power of the five lncRNAs and the risk score model was further validated in the testing dataset and another two independent cohorts. Functional analysis revealed potential underlying functions of these lncRNAs associated with common and critical cellular processes that drive human cancers. Gene enrichment and qPCR analyses identified ENSG00000206567 associated with gynecologic cancers.

II. MATERIALS AND METHODS
An overview of our pan-cancer prognosis analysis framework is illustrated in Fig. 1.

A. Data Acquisition and Study Design
There is a total of 731 lncRNAs reported by TCGA for pan-cancer analysis (4,266 cancer studies covered). The corresponding RNA-seq data (normalized by the FPKM method [28]) and clinical information were downloaded from the TCGA data portal (https://portal.gdc.cancer.gov/). After the exclusion of 35 cases without complete survival information, a total of 4,231 patients with a broad range of cancer types (n = 33) were enrolled in our study (TCGA cohort), Supplementary Table S1. To identify the prognostic lncRNAs related to multiple cancers, a study cohort (n = 2,210), which stratified into 85% cases for training/cross-validation (n = 1,878) and 15% for testing (n = 332), was identified from the TCGA cohort based on prognosis distribution (cutoff of 3.5 y to stratify the better and poor prognosis patients, Supplementary Table S2). The training data was used for the identification of prognosis-related lncRNAs, the test data was utilized to estimate the prognostic effectiveness, and the overall TCGA cohort was served as internal validation. Besides, another two independent cohorts, i.e. Therapeutically Applicable Research to Generate Effective Treatments (TARGET) and Clinical Proteomic Tumor Analysis Consortium (CPTAC), which provided by National Cancer Institute (NCI), contain 23 different types of cancer studies were further collected. Similarly, patient data without survival information were excluded and the remaining patient studies, i.e. TARGET (n = 1,122) and CPTAC (n = 391) cohorts, were used for external validation. In addition, three routine clinical parameters, including age at diagnosis, gender, and tumor stage, were included as co-variables.

B. Screening of Prognostic LncRNAs Associated With Pan-Cancer Prognosis
As shown in Fig. 1(a), statistical power is incorporated with machine learning models for identification and evaluation of independent prognostic lncRNAs, which contains three major steps:

1) Machine learning-Driven Stepwise Feature Selection
(MLSFS): A stepwise feature selection method, namely Recursive feature elimination with cross-validation (RFECV) [29], [30], was used to identify the candidate lncRNAs for prognosis prediction. RFECV is a self-contained algorithm equipped with cross-validation and can work independently without human intervention once configurated properly, thus reduce the interobserver variability and improve reproducibility and stability. The method is composed of two layers with Recursive Feature Elimination (RFE) as an inner layer being embedded in Cross-Validation (CV) as an outer layer. The outer CV layer consists of n folds of cross-validations. In each fold, the input data is first stratified into train/validation pairs, and then for each iteration in the inner RFE layer, Random Forest (RF) [31] assigns importance for lncRNAs (features) according to its internal decision trees. The least important feature is eliminated based on the feature importance. The trained RF is estimated on the validation data to calculate the performance score of the remaining features. Based on the performance scores of CV folds, the feature combination that achieves the maximum average accuracy score is chosen as optimal features. Finally, the previous output served as the data source for the next-round of RFECV until the number of features was invariant, which yielded the candidate lncRNAs for further analysis.

2) Selection of Independent Prognostic LncRNAs With Sta-
tistical Analyses: We used coxph, a function in R survival package, to compute the Cox proportional hazards regression models, which measures the association between the survival time of cancer studies and one (Univariate) or more (Multivariate) predictor lncRNAs. Univariate Cox regression analyses were first performed for each of the candidate lncRNAs. Statistically significant lncRNAs (P < 0.05) in Univariate analyses were further evaluated with Multivariate Cox regression analyses where gender, age at diagnosis, and tumor stage were included as covariables. Prognostic lncRNAs identified by significant Hazard Ratio (HR) (|HR − 1| ≥ 0.1) were finally subjected to Kaplan-Meier survival analyses to compute the survival probability over time, from which patient studies were separated into high and low expression groups based on their median values. The lncRNAs that exhibited significant survival difference (logrank test P < 0.001) for the two expression groups were selected. In this work, HR measures the relative risk ratio of candidate lncRNAs. HR > 1: risk factor, HR < 1: protective factor, and HR = 1: not a valuable factor.
3) Evaluation With the Machine Learning Model: RF was introduced as the prognostic performance estimator. The supervised machine learning model leveraged multiple decision trees for ensemble voting and reduced the classification variance through training on different parts of the training data [32]. A feature bagging technique was integrated to generate higher model performance by selecting a random subset of the features in each iteration [33]. The classifier was implemented using the scikitlearn library (https://scikit-learn.org/) in the present study. To tune the optimal hyperparameters, 5-fold nested cross-validation was adopted which partitioned the training/cross-validation data (n = 1,878) into 4:1 held-out dataset pairs. In the inner loop, each iteration (fold) further used additional cross-validation to search hyperparameters and fit the model. The following hyperparameters were tuned due to their superior performance under the nested cross-validation: a total of 1000 trees in the forest, with the maximum depth of the tree = 8, where samples to split an internal node are ≥ 25 and samples at a leaf node are ≥ 4.
The predictive performances of identified lncRNAs, three clinical factors, or their combination were further evaluated and compared on the test set (n = 332) after the model was trained on full training/cross-validation data using the identified hyperparameters.

C. Mapping of LncRNA Signature To Patient Prognosis
The study cohort was stratified into high-(H) and lowexpression (L) groups based on the median expression values of the corresponding lncRNAs (Supplementary Table S3). There are 2 n possible expression permutations for the signature (which composed of n lncRNAs). The relationship among the pan-cancer studies, permutations of the lncRNA signature, and corresponding patient survival outcomes were visualized in a Sankey diagram. The diagram was constructed using the Python plotly library. 1

D. Construction of Prognostic Signatures and Performance Validation
The overall TCGA cohort was stratified into low-and highexpression subgroups based on the median expression value of each identified lncRNA. The prognostic value of each lncRNA was first evaluated by conducting Kaplan-Meier analysis and log-rank tests on different subgroups.
Two risk score systems were then derived to estimate patient mortalities based on identified lncRNA signature and hybrid signature (composed of lncRNA and clinical factors) respectively, as follow: where Coef is the coefficient of risk factor k (lncRNA or clinical factor) measured by multivariate Cox regression analyses; Val is the value of the risk factor k; n is the number of factors.
To evaluate the prognostic significance of the signature, the overall TCGA cohort was stratified into low-and high-risk subgroups based on the median risk score. The prognostic values of the signature were measured by Kaplan-Meier analysis and the log-rank test. Time-dependent ROC/AUCs at three, five, and ten years were then calculated. Similar measurements were also conducted on another two external datasets to further validate the effectiveness of identified signature on different platforms.

E. Analysis of LncRNA Functions and Their Correlation With Clinical Factors
Spearman correlation coefficients were computed to measure the correlation between each identified lncRNA and genomewide RNA-Seq profiles (the other 60,482 genes). Genes that correlated with at least one of the lncRNAs in the TCGA cohort (Spearman correlation coefficients > 0.40 or < -0.30) were identified as co-expressed genes and enrolled in the functional enrichment analyses. The coefficient thresholds were determined under consideration of numbers reported in the literature as well as the characteristics of the datasets [34], [35]. The analyses were performed using an online gene annotation and analysis platform, named Metascape. 2 The correlation between lncRNA and other clinical factors, including gender, cancer tissue of origin, age at diagnosis and tumor stage, was evaluated using annotated heatmaps based on RNA-Seq data of the TCGA cohort, Fig. 1(d). The co-expressed genes included in this analysis were selected using the same standard as in the functional analysis mentioned above. Patient molecular data and their corresponding clinical parameters were accordingly sorted, and thereby, the correlation between the target lncRNA and the clinical factors can be observed.
Differential lncRNA(s) was evaluated by quantitative realtime polymerase chain reaction (qRT-PCR) using additional tissue sections (provided and approved by the Medical Ethics Committee of Qilu Hospital, Shandong University; consent was received from all involved subjects).

F. Statistical Analysis
Univariate and multivariate Cox proportional hazard regression analyses were performed to measure the association between lncRNAs (and/or clinical parameters) and patient overall survival. HR was calculated to measure the prognostic impact of various factors and a 95% confidence interval (CI) was used to indicate the precision of the estimated HR. Kaplan-Meier survival analyses were conducted to measure the overall survival difference between risk groups. The log-rank test was performed to evaluate the statistical significance. Time-dependent ROC analysis [36], which is used to assess the predictive power of diagnostic markers for time-dependent disease outcomes, was used to measure the predictive performance of the identified lncRNAs.

A. Screening of Independent Prognostic LncRNAs
The MLSFS method yielded 26 candidate lncRNAs that associated with OS in TCGA training studies under no clinical variables considered. After subjected to statistical analyses, ZNF883, ENSG00000277476, and PVT1 were first filtered as they were not significantly associated with OS in univariate Cox regression analyses (light gray, Table I). Next, clinical covariables were enrolled and 11 lncRNAs were further removed based on HR (|HR − 1| < 0.1) calculated in multivariate Cox regression analysis (dark gray, Table II). Kaplan-Meier survival analysis identified ENSG00000226380, FOXM1, and LINC02637 as insignificant factors to stratify different risk groups (light gray, Table II). In the second-round MLSFS, MEG8, MIR3142HG, and MIR4435-2HG were excluded because they did not emerge as significant factors in ML models once clinical features were involved. Although MEG3 exhibited prognostic significance, the lncRNA was not considered since it was a known prognostic biomarker in multiple cancers reported in the public literature (public domain knowledge). Thus, the signature associated with pan-cancer prognosis identified in this study is composed of five lncRNAs, i.e. ENSG00000206567, PCAT29, ENSG00000257989, LOC388282, and LOC388282.

B. Mapping of 5-lncRNA Signature to Patient Prognosis
The signature composed of five distinct lncRNAs and thus has 2 5 possible permutations if each lncRNA was stratified into high (H) and low (L) expression levels using the median value as cutoff (Supplementary Table S3). A Sankey diagram was developed to visualize the relationship between lncRNA signature (represented as 32 expression permutations or 5-lncRNA phenotypes) and patient prognosis in the study cohort, as shown in Fig. 2.  The right part of the diagram can be consulted independently, for example, the majority of patients with lncRNA phenotypes HHHLH, HHHLL, HHLLH (favorable phenotypes) related to better prognosis, whereas, patients with LLHHL and LLLHL  [39], i.e. BRCA, UCEC, and SKCM patients have a much higher five-year survival rate (>90%, 90%, and 92% respectively) compared to HNSC (50%). Besides, we found patients with LUSC and LUAD, which unlike BRCA, UCEC, SKCM, and HNSC, etc., are not densely mapped to any specific lncRNA phenotypes, which indicates the prognoses of LUSC and LUAD patients are more heterogeneous, this finding is supported by recent research studies [40], [41]. The prognostic value gradually declined with the phenotype arrangements approaching the center of the diagram.

C. Prognostic Performance of The 5 LncRNAs on Test Data
The RF model with the 5-lncRNA signature demonstrated a prognostic AUC performance of 0.714 on the test data, which is significantly superior to the predictive performance of routine clinical factors (AUC of 0.656), Fig. 3(a). After the combination of the 5-lncRNA signature with clinical parameters, the model yielded a classification AUC of 0.758 for differentiation of better vs. poor prognosis in test studies, Fig. 3(a). These results indicated that the signature exhibited good discrimination and calibration.

D. Construction of 5-lncRNA Risk Score Model and Validation on Overall TCGA Cohort
The five lncRNAs were found all significantly associated with OS (log-rank test P ≤ 1.48E-11) of the TCGA cohort (n = 4,231), Fig. 3 B-F. ENSG00000206567, PCAT29, ENSG00000257989, and LINC00339 tended to be the protective factors since their high expressions were related to longer survival ( Fig. 3 B-D and F), whereas LOC388282 might be a risk factor because its high expression was associated with poor prognosis (Fig. 3(e)).
The relative contributions (coefficients) of the five prognostic lncRNAs to OS were derived from multivariate Cox regression measured in the study cohort.  63E-35). The overall TCGA studies were stratified into two risk groups using the median risk score as cutoff and Kaplan-Meier analysis demonstrated patients' OS in the low-risk group is significantly better than the high-risk group (median OS of 9.84 versus 4.37 years, log-rank test P = 1.48E-38), Fig. 4(a). Time-dependent ROC analysis showed that the score model achieved AUC of 0.66 at five years (95% CI: 0.632-0.684), which is relatively higher than three (AUC of 0.64) and ten years (AUC of 0.63),  Table S3). P values were calculated by two-sided log-rank tests.    Table S4).
Combined with gender, age at diagnosis and optional tumor stage, the hybrid risk score model achieved a higher timedependent ROC/AUC of 0.718 (95% CI: 0.677-0.760) for longterm (ten years) risk estimation but no significant boost in three (AUC of 0.68) and five years (AUC of 0.67), Fig. 4(b). Similarly, the hybrid model stratified patient studies into two risk groups with distinct survival outcomes (median OS of 11.86 years in the low-risk group compared with 4.22 years in the high-risk group, log-rank test P = 4.86E-51).

E. Validation of 5-lncRNA Risk Score Model on Another Two Independent External Cohorts
The risk score model was further applied on TARGET (n = 1,122) and CPTAC (n = 391) cohorts using the coefficients derived from the TCGA study cohort. The lncRNA risk score achieved HR of 2.393 (95% CI: 1.611-3.553, P = 1.53E-05) on TARGET and 1.922 (95% CI: 1.436-2.572, P = 1.121E-05) on CPTAC. The two external cohorts were stratified into highand low-risk groups respectively. Time-dependent ROC analysis showed that the lncRNA model achieved AUC of 0.60 at three years and 0.59 at five years respectively on TARGET, and AUC of 0.75 at three years on CPTAC cohort (5-year AUC could not be calculated due to uneven survival distribution), Fig. 5(a). Kaplan-Meier analyses showed the two risk groups in CPTAC and TARGET demonstrated significantly different survival outcomes respectively (log-rank test P = 8.61E-09 and P = 2.68E-06), Fig. 5 B-C. The performance of the hybrid model (lncRNA with clinical factors) was not presented here since no sufficient clinical parameters, especially age at diagnosis and tumor stage, provided in the two datasets.

F. Identification of Biological Processes Associated With the Lncrna Signature
By computing Spearman correlation coefficients between lncRNA signature and genome-wide RNA-Seq data in the TCGA cohort, a total of 1,299 positively or negatively genes were found correlated with at least one of the five lncRNAs in our signature. Functional and pathway analyses showed those co-expressed genes with prognostic lncRNAs were significantly enriched in 215 GO terms and 18 KEGG pathways (Log10(P) < -2 or P < 0.01), which mainly involved in cilium movement and organization, cell projection, lymphocyte activation, T cell activation and differentiation, cellular response, primary immunodeficiency, and metabolic processes of xenobiotic, drug, flavonoid, hormone, retinoic acid and so on (Table III).
Top enriched pathways are also related to diverse human cancers and are critical in cancer progression and treatment, for example, drug metabolism affects multidrug resistance and chemotherapy in cancer [88], chemical carcinogenesis is a major reagent in the etiology of cancer [89], [90], ECM-receptor interaction is involved in six critical cancer hallmarks [91], [92] and patients with primary immunodeficiency are at increased risk to develop certain cancers [93], [94]; ascorbate and aldarate metabolism and cytokine-cytokine receptor interaction are related to clinical outcomes of colorectal cancer and recurrence of childhood acute lymphoblastic leukemia [95]- [97]; steroid hormone biosynthesis was identified as critical targets for breast  III  TOP ENRICHED GO TERMS AND KEGG PATHWAYS RELATED TO LNCRNA SIGNATURE and prostate cancer therapy [98], [99]; hematopoietic stem cell approach was recently used in cancer immunotherapy [100], [101]. More interestingly, the co-expressed genes are also enriched in Malaria and Chagas disease pathways (Table III).

G. Association Between ENSG00000206567 and Gynecologic Cancers
Expression patterns of the five lncRNAs were investigated through the identification of co-expressed genes of each lncRNA from the entire TCGA cohort and ranking the studies according to corresponding lncRNA expression value. Such gene enrichment analysis was used to determine whether gene expression patterns plotted according to specific lncRNA were associated with any clinical factors, including age at diagnosis, gender, cancer type, and so on. Genes positively correlated with ENSG00000206567 displayed relatively higher expression in gynecologic cancers (also known as female reproductive system cancers), including ovarian (OV), uterine corpus endometrial cancer (UCEC), cervical squamous cell carcinoma (CESC), and breast cancer (BRCA), rather than in other human cancers (heatmap in Fig. 6A). However, no such significant correlation was observed for the other four lncRNAs (data not shown here). To further validate this finding, qRT-PCR was conducted using RNAs isolated from breast and cervical tumors and normal tissue pairs. Total RNA was isolated with TRIzol Reagent (Thermo Fisher Scientific; Waltham, MA, USA) following the manufacturer's instructions. cDNA was synthesized from total RNA (2 µg) using a reverse transcription kit (Toyobo Life Science; Osaka, Japan). qPCR was performed in triplicate using 1 µL of cDNA in a standard SYBR Premix Ex Taq (Roche; Pleasanton, CA, USA) on a Real-Time PCR Detection System (480II, Roche). As a result, ENSG00000206567 was found deferentially expressed in tumor and normal tissue pairs ( Fig. 6(B)-C)).

IV. DISCUSSION AND CONCLUSION
Along with the improvement of sequencing techniques, more meaningful noncoding transcripts have been identified as underlying drivers in human cancers [102]- [104]. LncRNAs, the "dark matter" of the non-coding family, have been gradually identified as cancer prognostic predictors and proposed to be therapeutic targets in recent years [11], [105]. In our present work, statistical and machine learning combined analysis enabled us to identify and validate lncRNA biomarkers from pan-cancer studies. The 5-lncRNA risk score model stratified cancer studies into high-and low-risk groups with significantly distinct survival outcomes. An indexing system was developed to illustrate the relationship between 5-lncRNA signature and OS of different types of cancer patients. The lncRNA signature demonstrated superior predictive performance than clinical factors and achieved more accurate longer-term (10 years) mortality prediction after combined with gender, age at diagnosis and tumor stage.
The signature was found associated with characteristics and processes fundamental to human cancers. To a large extent, this relationship supports the signature as a pan-cancer prognostic biomarker. For example, extracellular matrix (ECM) receptor interaction [106], cilium organization [107] and T cell activation [108] were related to positively co-expressed genes of our signature, whereas xenobiotic metabolism [109], [110], inflammatory response [111], and cell adhesion regulation [112] correlated to negatively co-expressed genes. ECM and its receptors have been demonstrated to be involved in all six hallmarks of human cancers [91], including resistance to cell death, induction of angiogenesis, replicative immortality, invasion and metastasis, loss of growth suppression and dysregulated proliferation [113]. The 5-lncRNA signature might, therefore, be a molecular reference for understanding the interplay between ECM, and cancer and stromal cells which might be critical for cancer prevention [91]. The correlation with primary immunodeficiency, T cell activation, inflammatory response, and drug metabolism also implicates a role in the immune response to cancer [114]. The prognosis of cancer patients has been linked to the activation status of innate T cells which is the front line of host defense against cancer [115]. Malaria and Chagas disease were found associated with cancer [116]- [119]. The relation to Malaria and Chagas disease pathways indicates the lncRNA signature identified from the pan-cancer scenario captures shared characteristics that drive diverse human diseases. This may facilitate drug repurposes, for example, Weyerhäuser et al. repurposed antimalarial chloroquine for treatment of glioblastoma [119] and Kraus et al. used anti-cancer drugs to treat Chagas disease [118]. Functional and tissue studies are needed, however, to determine their exact role in these cellular processes.
There are considerable difficulties for the identification of prognostic biomarkers in the pan-cancer scenario due to cancerspecific context and cancer subtype in the clinic, however, the interdisciplinary research collaborations among computer scientists, biologists, and clinicians empower the capacities and opportunities in searching prognostic predictors common effective among multiple cancers. With the help of artificial intelligence, bioinformatics analysis, and multi-center genomic databases, recent achievements in this field further ascertain the significance and viability of research on lncRNAs with pan-cancer prognostic significance. For instance, a recent study investigated the unique molecular features common to pan-gynecologic cancers and related them with the prognosis of 2,579 patient studies [120]. Another research investigated lncRNAs with cancer patient prognosis by identification of survival-related lncRNAs in multiple-cancer context [121]. This study demonstrates the great value of pan-cancer research for capturing shared characteristics among cancer-specific prognostic markers. Rather than conflicting with cancer-specific prognosis, the investigation of the pan-cancer prognosis opens up a new perspective for a better and more complete understanding of the cancer mechanisms.
In conclusion, we identified prognostic lncRNAs that significantly associated with the OS of pan-cancer studies by innovatively combining the machine learning model and traditional statistical analysis as a whole biomarker identifier. Although it requires intensive computing power, the proposed framework is a new, streamlined, and intuitive pan-cancer analysis approach, which can be easily understood and implemented. Besides, the lncRNA signature identified with the framework is robust, independent of clinical factors, and demonstrated significant prognostic value on different pan-cancer cohorts and cancer types. With the framework, as far as we know, it is the first time to map lncRNA expression profiles to patient prognosis across multiple cancer types. The lncRNA indexing system and the risk score model may provide insights to develop a "cancer prognosis screening" tool. Meanwhile, the identified lncRNAs are found associated with common biological processes fundamental for cancer initialization and progression, and the majority of them have never reported before which may serve as novel molecular targets for therapy and show efficacy across a broad spectrum of human cancers.

V. DATA AVAILABILITY AND EXPERIMENT REPRODUCIBILITY
All the data used in our study can be downloaded from the TCGA data portal 3 using the official GDC Data Transfer Tool. The computer code used in this study and supplementary material have been released at: https://github.com/guoqingbao/ PanCancerLncRNA.