Logistic Regression Analysis for LncRNA-Disease Association Prediction Based on Random Forest and Clinical Stage Data

An increasing amount of studies have found that LncRNA plays an important role in various life processes of the body. In current prediction research on lncRNA-disease associations, correlation analysis of disease prognosis is overlooked. In this study, a logistic regression prediction model based on tumor clinical stage data and the expression quantity of lncRNA transcript is constructed. The proposed model is based on unknown human lncRNA-disease associations combining with the clinical stage data. Firstly, the importance of the characteristic variable is calculated by the proposed CVSgC-RF algorithm. Secondly, 95 lncRNAs, which are most closely related to prostate cancer, are calculated from 480 alternative lncRNAs by CASO and CVSe-CS-CF. On the basis of the above 95 lncRNAs, the CSPA-PL algorithm is used to select a further 22 lncRNAs that are most closely related to the tumor clinical stage for prostate cancer. Finally, 22 lncRNAs are used to construct a logistic regression prediction model. Additionally, this method is applied to lung cancer data; 16 lncRNAs are selected to construct a logistic regression prediction model for lung cancer. Experimental results show that the best results for ROC Area, the accuracy and recall rate of the prediction model are achieved by the proposed method for prostate cancer and lung cancer, which provides a promising basis for subsequent prediction studies of lncRNA-disease associations.


I. INTRODUCTION
Long non-coding RNA (lncRNA) is non-coding RNA with more than 200 nucleotides in length [1]. It has very important biological functions and is another important area in the bioinformatics field [2], [3]. Studies show that lncRNA is closely correlated with many diseases, such as lung cancer [4], [5], Alzheimer's disease [6], osteosarcoma [7], breast cancer [8], gastric cancer [9], colon cancer [10], prostate cancer [11], cervical cancer [12], etc. At present, more and more researchers are engaged in research in this area, which is an important molecular target in the diagnosis and treatment of disease. It is extremely important to study the relationship The associate editor coordinating the review of this manuscript and approving it for publication was Vincenzo Conti . between lncRNA and the prognosis of cancer patients by utilizing clinical data. Current research in lncRNA is in the initial stages; people still know little about the deep mechanisms in the occurrence and development of cancer. Therefore, it is important to study lncRNA, which has a significant impact on the prognosis of cancer patients by using bioinformatics combined with clinical data.
Relevant research results in recent years are broadly divided into three types, as follows.
The first type is machine-learning-based methods and known disease-related lncRNAs. For instance, Yu et al. [13] proposed a new method called CFNBC based on the Naïve Bayes classifier to predict lncRNA-disease association. The novelty of CFNBC lies in the introduction of the item-based collaborative filtering algorithm and Naïve Bayes classifier, which guarantee that CFNBC can be applied to predict potential lncRNA-disease associations efficiently without entirely relying on known miRNA-disease associations. Cui et al. [14] developed a novel model called BLM-NPAI for predicting lncRNA-disease associations. The main advantage of BLM-NPAI was that it could also make predictions using nearest neighbors for some lncRNAs and diseases without any association. Chen and Yan [15] used semi-supervised learning to predict the potential associations between lncR-NAs and diseases, and proposed the first lncRNA-disease association prediction model (LRLSLDA) on the premise that similar functions of lncRNA tended to result in similar diseases. However, the model was too complex and exhibited high computational complexity. Meanwhile many parameters need to be selected in the calculation process. Huang et al. [16] improved the calculation of disease similarity based on the framework of LRLSLDA to further improve the prediction results, and presented a new method, ILNC-SIM. This approach kept the general hierarchical structure information of disease DAGs and determined the disease similarity calculation based on an edge-based method. Finally, the prediction performance was improved to some extent, but there were still some limitations. For example, the similarity score in the model needs to be further optimized. The lack of unrecorded but real lncRNA-disease associations had a large impact on the model, and the integration of multiple types of data was lacking. Chen [17] built a new approach (KAT-ZLDA) by integrating known lncRNA-disease associations, lncRNA expression profiles, lncRNA functional similarity, disease semantic similarity and Gaussian interaction profile kernel similarity to predict the potential LncRNA-disease associations. The biggest advantage of KATZLDA was that it could be effectively applied to new diseases and lncR-NAs without any known associations. However, the learning network built by KATZLDA was based on the known correlation relationship, so this was limited by the known learning knowledge and had certain limitations in prediction. Zhao et al. [18] constructed a multi-source data set by integrating multidimensional data (genome, regulatory group and transcriptome), and proposed a Bayesian classification method using this multi-source data to predict the lncRNA-disease associations. Experimental results showed that this method successfully identified 707 lncRNAs related to human cancer. However, this method was a supervised classification algorithm which required a large number of negative cases, but these are difficult to obtain.
The second type is network-based methods. For instance, Li et al. [19] present a novel network consistency projection approach called NCPLDA for lncRNA-disease association prediction. The network was built by integrating the lncRNA-disease association probability matrix with the integrated disease similarity and lncRNA similarity. Zhou et al. [20] proposed a new method (RWRHLD) that built a heterogeneous network of lncRNA-disease associations, on which a random walk algorithm was executed. However, the limitation was that the incomplete coverage of the lncRNA crosstalk network and the lncRNA-disease associations could lead to inaccurate predictions. Liu et al. [21] established a bidirectional network of protein-coding genes (PCG) and lncRNA for the prostate cancer and protein interaction databases based on lncRNAs and PCG expression maps, and further realized lncRNA-disease association prediction based on this network. However, the method was limited by the incomplete protein interaction database, and its performance had some limitations.
The third type is RW-based methods (RW is short for random walk). For instance, Li et al. [22] proposed a prediction model called LRWHLDA for inferring LncRNA-disease association. LRWHLDA can be implemented in the case of lacking known lncRNA-disease associations by using an improved local random walk method. Yu et al. [23] used multidimensional heterogeneous data to construct lncRNA networks with similar functions and the disease ontology to construct disease networks. On this basis, BRWLDA was proposed to predict the lncRNA-disease associations. BRWLDA improved the random walk model and the prediction performance to some extent.
To summarize, the limitations of the current research were described by the review [24] and the aforementioned discussions. Current studies have ignored correlation analysis of clinical prognosis, concerns of the prediction model have been limited to a single lncRNA forecast. The clinical prognosis of the disease associated with lncRNA information is rarely involved, such as tumor clinical stage, tumor pathological stage, survival time, disease status, family history of genetic diseases, and so on.
In this study, a logistic regression prediction model of lncRNA-disease associations based on the tumor clinical stage data was constructed. Three kinds of circular allelism operations ( center ( sub [a,b] ), X −axis ( sub [a,b] ), Y −axis ( sub [a,b] )) were proposed for the prediction model. The calculation of the significance of characteristic variables based on random forests was proposed, and the selection algorithm for the characteristic variables was given. Finally, the clinical stage prediction algorithm of cancer-associated lncRNA was implemented using the simplified characteristic variables. Experimental results showed that the proposed method had a higher predictive performance.

A. LNCRNA DATA
The lncRNA expression data for prostate cancer was obtained from the lncRNAtor database [25]. A total of 220 samples were obtained (denoted by S normal∪tumor = {S 1 , · · · S 220 }), including 44 normal samples (denoted by S normal = {S 1 , · · · , S 44 }) and 176 cancer samples (denoted by S tumor = {S 45 , · · · , S 220 }). Based on the differential expression P-value (P ≤ 0.001) of lncRNA transcripts between S normal and S tumor , 480 lncRNA transcripts with significant differences (denoted by Lr 1 , Lr 2 , · · · , Lr 480 in ascending order of P-value) were obtained. Of these, 480 lncRNA transcripts  were denoted by Lr = {Lr 1 , Lr 2 , · · · , Lr i (1 ≤ i ≤ 480)}, Lr sub was the subset of Lr(Lr sub ⊆ Lr) and the expression of Lr on S i was denoted by Lr S = Lr S i 1 , Lr S i 2 , · · · , Lr S i 480 (1 ≤ i ≤ 220) . Figure 1 shows the details of the top 20 lncRNA transcripts in the 480 transcripts, where each row represents one lncRNA transcript. The first column is the rank of the transcript, the second column is ensemble gene ID, the third column is the gene name, the fourth column is the ensemble transcript ID, the fifth column contains the P values of the differential expression between normal samples and cancer samples. Each column after the fifth column is the expression quantity of the transcript in the sample. These transcripts were the more pronounced differences between normal and cancer samples.

B. CLINICAL DATA
Clinical data associated with S normal∪tumor were obtained from the TCGA database (https://cancergenome.nih.gov). The aliquot barcode of the clinical data in S normal (size 44) and S tumor (size 176) are presented in Table 1. Each S i contains 70 clinical reference values. Some of these were retained as follows: barcode and sample type of S normal∪tumor (denoted by P normal∪tumor = {P 1 , · · · , P 220 }), tumor clinical stage of S tumor (denoted by CT tumor = {CT 1 , · · · , CT 176 }). Of these, the barcode was used to correlate the clinical data with the lncRNA data, the sample type was used to select the characteristic variables of the lncRNA-disease associations, and the tumor clinical stage was used to predict lncRNA with significant impact on the prognosis of cancer patients combined with the clinical data. In this case, the clinical stage (CTNM) was performed using the TNM stage system, where T represents the tumor size, N represents lymph node metastasis, and M represents distant metastasis. The distribution of CT tumor (size 176) associated with S tumor is shown in Table 2. As can be seen from this The following two matrices were constructed by combining lncRNA data and clinical data for the prediction study in this paper.
(a) The matrix M CV of the characteristic variables is shown in (1), which contains 480 columns of characteristic variables, 1 categorical variable column, and 220 rows of sample data. After the selection algorithm, the λ lncRNAs that were most closely related to prostate cancer were screened from 480 characteristic variables.
The matrix M PP of prognosis prediction is shown in (2), which contains λ columns of characteristic variables, 1 categorical variable column, and 104 rows of sample data.

C. CASO METHOD
The circular allelism subarea operation (abbreviated to CASO) for characteristic variable selection is described below. The 480 Lr i s in M CV formed a circular queue . According to the importance of each Lr i , the Lr i in descending order are evenly clockwise distributed on the ring .
contains 480 nodes in descending order of importance (denoted by Here, the node S j is Lr i , and the subset of is denoted by . The descending queue formed with Lr sub in descending order according to Significance(Lr i ) is Q Dec (Lr sub ).
The circular queue is formed as shown in Figure 2. The descending order from S 1 to S 480 is evenly clockwise distributed on . The center symmetric points of S a and S b are S a−center and S b−center . The X-axis symmetric points of S a and S b are S a−X and S b−X . The Y-axis symmetric points of S a and S b are S a−Y and S b−Y . Figure 2 illustrates the following. The red line, area and data represent the center allelism operation. The green line, area and data represents the X-axis allelism operation. The yellow line, area and data represents the Y-axis allelism operation. The circular queue is divided into four areas (the first quartile area, the second quartile area, the third quartile area, and the fourth quartile area).

Definition 1 (Center Allelism Operation center
. sub−center [a,b] is the center allelism area of sub [a,b] . The set of central symmetric is the X-axis allelism area of sub [a,b] . The set of X-axis symmetric points of sub , which is located in the first quartile area and the third quartile area. The green area in Figure 2 , which is located in the first quartile area and the second quartile area.
The yellow area in Figure 2 , which is located in the first quartile area and the fourth quartile area.
Additionally, sub−X −axis ) constitute the circular subset required by the next algorithm.

D. RANDOM FOREST
Random forest (abbreviated to RF) is an enhanced classifier constructed by multiple decision trees. In the process of building the decision tree, it is necessary to order the importance of variables. Since a random forest has a large number of decision trees, the importance obtained from each decision tree could be integrated to obtain the final importance rank of the variables. The selection of characteristic variables is carried out according to the order of the variables, which is more stable and reliable than a single decision tree. In the selection of M CV characteristic variables based on RF, RF contained α trees (denoted by T = {T 1 , · · · , T i , · · · , T α }), 480 Lrs (denoted by Lr = {Lr 1 , · · · , Lr i , · · · , Lr 480 }) in M CV are the characteristic variable set, and 220 Ps (denoted by P = {P 1 , · · · , P i , · · · , P 220 }) in M CV are the classified variable set.

E. CVS G C-RF ALGORITHM
The significance computing of the characteristic variables based on RF (abbreviated to CVS g C-RF) is given in Algorithm 1. Here, OOB means out-of-bag. Records are extracted from the original data to construct the training set for decision tree learning. Because this process uses sampling with replacement, some samples are not included, termed out-ofbag. On average, 37% of the data is not selected in each VOLUME 8, 2020 sampling with replacement, which is often used to validate the constructed decision tree model. If the characteristic variable Lr i upseted on OOB has no effect on the result of the decision tree, then Lr i is deemed to be not important. If the reverse is true, then Lr i is very important. The CVS g C-RF algorithm is shown in Algorithm 1. In this algorithm, RF(Lr sub , τ, η) is a random forest containing τ decision trees by being trained on Lr sub . τ is the number of decision trees contained in a random forest. η is the number of random characteristic variables contained in each partition (η = RF cv−number (Lr sub ) + 0.5 ).

F. DISCUSSION OF CVS E
A good characteristic variable selection (abbreviated to CVS e ) algorithm must possess both global selectivity and local stability. Global selectivity and local stability are mutually restricted. For example, the larger the selection range, the more complex the mutual relations among the characteristic variables, and bidirectional influence relations coexist. Local stability is needed for adjustment, but is limited by the selection range, which leads to insufficient coverage of the correlation between characteristic variables. At this point, an increase in global selectivity is required. Therefore, how to adjust the global selectivity and local stability of an algorithm is very important, and could affect the performance of CVS e .  Primary Stage: The 480 Lr i s are ranked in descending order of importance, with the top 2 Lr i s entering the candidate area and the remaining 480 −2 Lr i s entering the observation area. Since the Lr i entering the candidate area is selected in the global range of 480, global selectivity is obtained. However, the larger the scope, the more complex the relationship of Lr i will become, and the bidirectional influence relationship exists. In the following stable stage, local stability is used to address this problem. The detailed process is described in steps 1-2 of Algorithm 2.
Stable Stage: CASO is performed for each Lr i that entered the candidate area during the primary stage. The detailed execution process is shown in Figure 3. In order to facilitate the execution of the algorithm, the endpoints of the three allelism areas are adjusted to S b−center −1, S b−X −1, and S b−Y −1. The center allelism area, X-axis allelism area and Y-axis allelism area are located in the observation area. The candidate area and the three allelism areas of the above observation areas are rearranged in descending order of importance. (This is the CVS g C-RF algorithm that was discussed previously.) Each Lr i that is screened from the candidate area into the observation area is added to the penalty set. The top Lr i s that are screened from the observation area into the candidate area are added to the shock set. Each Lr i in Set penalty has a potential risk of poor stability. Each Lr i in Set shock has strong reactivation activity. The detailed process is described in steps 3-24 of Algorithm 2.
Run-Off Stage: Set penalty and Set shock are used to update each Lr i in the candidate area. Remove Lr i in Set penalty from the candidate area and add it to Set shock in the candidate area. The updated candidate area is the result of characteristic variable selection (denoted by select ). The number of characteristic variables in select is λ. The detailed process is described in steps 25-27 of Algorithm 2.
In Figure 3, is set to 60, the candidate area is [S 1 , S 120 ], and the observation area is [S 121 , S 480 ]. , a and b are set to 120, 1, and 120 in Figure 3

H. CSPA-PL
The CVS e -CS-CF algorithm selected λ lncRNAs that were most closely related to prostate cancer. Next, λ lncRNAs are correlated with the prognosis data of the tumor clinical stage, and a logistic regression model is adopted to propose a clinical stage prediction algorithm for cancer-associated lncRNA(abbreviated to CSPA-PL). Concerning the related operations of select , CSPA-PL is divided into an inspection stage and an optimization stage. CSPA-PL is described in Algorithm 3.  Inspection Stage: Before being applied to the clinical stage prediction model, the λ Lr i s most closely associated with prostate cancer need to be inspected. The detailed process is described in steps 1-11 of Algorithm 3. In the inspection stage, select is divided into select  [1,n] . The process data of [1,m] . [1,n] ) AIC means that the set of Lr i with the smallest AIC value in Z pre [1,n] is put into {ϒ pre }. Steps 5-10 indicate that the set of Lr i with the smallest AIC values and significance less than 0.01 in Z pre [1,n] is put into {ϒ rear }. The Cartesian product of {ϒ pre } and {ϒ rear } is performed to form a buffer pool (Buffer − pool).
Optimization Stage (Steps 12-17 of Algorithm 3): Each element in Buffer − pool constructs a logistic regression model, and the model with the highest accuracy (denoted by maximum( i )| Accuracyrate) is selected as the optimal prediction model (denoted by optimal ).

A. PERFORMANCE EVALUATION OF CVS G C-RF
When constructing RF in the CVS g C-RF algorithm, the number of decision trees τ in the random forest has a large impact on the performance and efficiency of the algorithm. In order to determine the optimal value τ optimal , three groups of experiments were carried out under the premise of τ ∝ RF cv−number (Lr sub ). Each group of experiments involved 10 randomized experiments for different τ , and a comparative analysis was given using the lost count and stability. The calculation of the lost count is shown in (3). The loss count of the j-th τ is denoted Lost(τ j ). The value of j is an integer between 1 and h (denoted by Z [1, h]). T lost i τ j is the lost count of the j-th randomized experiment for τ j .
Stability was investigated from two aspects: internal stability and external stability.
The calculation of internal stability is shown in (4). The internal stability of τ j in the top d ranges is denoted Internal − stability(τ j |pre − d ) in (4). The union of the results for τ j on 10 randomized experiments in the top d ranges is denoted The calculation of external stability is shown in (5). The external stability of τ j in the top d ranges is denoted External − stability(τ j |pre − d ) in (5). The union of the results for τ j∈ [h,h] in the top d ranges is denoted h j=h Lr(τ j ).
The occurrence counts of Lr i in the union are denoted count(Lr i h j=h Lr(τ j ) ).
In the first group of experiments, RF cv−number (Lr sub ) is set to 480, and 18 groups of data are taken in the interval [3000, 11500](h = 18,h = 9). The experimental results are shown in Table 3 and Table 4. It can be seen from Table 3 that the lost count gradually approaches 0 from τ 9 = 7000, and there are two fluctuations of 1 loss in τ 11 = 8000 and τ 12 = 8500, and 0 loss is stable from τ 13 = 9000. It can be seen from Table 4 that from τ 12 = 8500, the internal stability of the top 60 and the top 80 both reached 100%, the top 100 reached more than 98.90%, and the top 120 reached more than 93.50%. From τ 12 = 8500, the external stability   of the top 60 and the top 80 both reached 100%, the top 100 reached more than 99.40%, and the top 120 reached more than 95.75%. Figure 5 shows that both the internal stability and external stability had a relatively high stability trend from τ 13 . The left boundary (τ 13 = 9000) was moved two digits to the right. Finally, τ optimal was set to τ 15 = 10000, and η was set to In the second group of experiments, RF cv−number (Lr sub ) is set to 240, and 13 groups of data are taken in the interval [1000, 7000] (h = 13,h = 6). The experimental results are shown in Table 5 and Table 6. It can be seen from Table 5 that the lost count gradually approaches 0 from τ 6 = 3500, and there is a fluctuation of 1 loss in τ 7 = 4000, and 0 loss is stable from τ 8 = 4500.
It can be seen from Table 6 that from τ 6 = 3500, the internal stability of the top 60 and the top 80 both reached 100%, the top 100 almost reached 100% (except a fluctuation of 99.80% for τ 10 = 5500), and the top 120 reached more than 95.33%. From τ 6 = 3500, the external stability of the top 60, the top 80 and the top 100 all reached 100%, and the top 120 reached more than 95.56%. Figure 7 shows that both the internal stability and external stability had a relatively high stability trend from τ 10 (except for a fluctuation in the top 100). The left boundary (τ 8 = 4500) was moved three digits to the right. Finally, τ optimal was set to τ 11 = 6000, and η was set to √ 240 + 0.5 = 15.
In the third group of experiments, RF cv−number (Lr sub ) is set to 120, and 9 groups of data are taken in the interval [500, 2500] (h = 9,h = 6). The experimental results are shown in Table 7 and Table 8. It can be seen from Table 7 that 0 loss is stable from τ 6 = 1000. It can be seen from Table 8 that from τ 6 = 1000 the internal stability VOLUME 8, 2020    80 both reached 100%, and the top 100 reached more than 94%. Figure 9 shows that both internal stability and external stability had a relatively high stability trend from τ 8 . The left boundary (τ 6 = 1000) was moved two digits to the right. Finally, τ optimal was set to τ 8 = 2000, and η was set to √ 120 + 0.5 = 11. Figure 4, Figure 6 and Figure 8 show the following: The lost count for τ j was more in the early stage and decreased greatly in the middle stage, but was unstable. It decreased to 0 in the later stage and tended to be stable. Note: ''+'' in Tables 4, 6 and 8 represents 100.

B. PERFORMANCE EVALUATION OF CVS E -CS-CF
The parameter settings for the CVS e -CS-CF algorithm are shown in Table 9. The Set penalty obtained by the CVS e -CS-CF algorithm is denoted Set i penalty (i ∈ [1, 25])   (as shown in Table 10). The location of Set i penalty in is denoted position(Set i penalty ) . The relative position coefficient of Set i penalty (RPC(Set i penalty )) is calculated via (6). The Set shock containing 30 elements is shown in Table 11.
RPC is a value between 0 and 1. The smaller it is, the more important the Lr i is. This indicates that the location of Set i penalty in is at the front. The larger the RPC value, the less important the Lr i is. This indicates that the location of Set i penalty in is later in the queue. If RPC(Set i penalty ) is larger, Set i penalty will be punished. It indicates that the algorithm could protect those Lr i with higher importance, so the algorithm has better stability. According to Figure 10, 96% of RPC(Set i penalty ) in Set penalty are above 0.49 (except 0.44), the mean of which is 0.79. This indicates that the top 60 Lr i s in are stably protected and the algorithm has good stability. Ultimately, λ lncRNAs most closely related to prostate cancer are selected by the CVS e -CS-CF algorithm from 480 lncRNAs (λ = 95). The results are shown in Table 12.     of 21 groups (Z pre [1,21] ) were obtained. The AIC distribution of Z pre [1,21] is shown in Figure 11. This shows that the value of Z  (2) is obtained. The values of i(1) and i(2) in the above Lr i (1) and Lr i (2) are shown in Table 13. There were three alternative sets in Buffer − pool, which were Buffer − pool 1 (2) and j(3) in the above Lr j (1) , Lr j (2) , and Lr j (3) are shown in Table 14.
The accuracy comparison experiment for Buffer − pool * i for the logistic regression model shows that the accuracy rate of Buffer − pool 2 = {{ϒ pre } ∪ Lr 79 } is the highest. Finally, the optimal logistic regression model for tumor clinical stage (denoted by optimal ) is obtained, which contained 22 Lr i s (as shown in Table 15).   It can be seen that the CSPA-PL algorithm could further select 22 Lr i s from the 95 Lr i s most closely related to prostate cancer, which are most closely related to the tumor clinical stage of prostate cancer.
In order to verify the universality of the work presented in this paper, we also chose the lung cancer data set in the lncR-NAtor database and TCGA database for experimentation. A total of 290 samples were obtained, including 46 normal samples and 245 cancer samples. First, the importance of the characteristic variable was calculated using the CVS g C-RF  algorithm. Second, 120 lncRNAs which were most closely related to prostate cancer, were calculated from the 480 alternative lncRNAs by CASO and CVS e -CS-CF. On the basis of the above 120 lncRNAs, the CSPA-PL algorithm was adopted to further select 16 lncRNAs that were most closely related to the tumor clinical stage of lung cancer. Finally, 16 lncRNAs were used to construct a logistic regression prediction model.

D. PREDICTION RESULT
Three state-of-the-art methods (MlrLDAcp [26], REP-Tree [27], NaïveBayes [28]) were selected to compare with CSPA-PL by 10-fold cross validation. The comparison experiments were carried out from three aspects: ROC area, prediction accuracy and recall rate. The results of ROC area for prostate cancer are shown in Figure 12. The mean ROC area of three compared methods for prostate cancer was 0.673, and the ROC area of the CSPA-PL method in this paper was 0.857, which was the largest and 1.27 times that of the other methods. The results of ROC area for lung cancer are shown in Figure 13. The mean ROC area of the compared methods   for lung cancer was 0.666, and the ROC area of the CSPA-PL method in this paper was 0.842, which was the largest and 1.26 times that of the other methods. In the comparison experiment for recall rate, the accuracy and recall rate were often mutually restricted and offset each other. Therefore, AV-PR was implemented and used in this paper as the mean of the prediction accuracy and recall rate. The results for AV-PR are shown in Figure 14(The prostate cancer is shown by blue, the lung cancer is shown by red.). The mean AV-PR of the compared methods for prostate cancer was 0.723, and the AV-PR of CSPA-PL with the maximum was 0.889, which was about 1.231 times that of the other methods. The mean AV-PR of the compared methods for lung cancer was 0.784, and the AV-PR of CSPA-PL with the maximum was 0.896, which was about 1.142 times that of the other methods. These results indicate that the accuracy rate and ROC area for CSPA-PL were both good.

IV. CONCLUSION AND DISCUSSION
Although some methods have been applied to lncRNAdisease association prediction, clinical prognostic data were rarely involved. In this study, we constructed a clinical stage prediction algorithm for cancer-associated lncRNA (CSPA-PL), which utilized cancer clinical stage data. CSPA-PL was based on unknown human lncRNA-disease associations combining with the clinical stage data. The core modules of CSPA-PL included CASO, the CVS g -RF algorithm, and the CVS e -CS-CF algorithm. For CASO, a learning mode was formed in which the first quartile area was a defensive area and the other quartile areas were offensive. Three symmetric ideas were adopted, which were center allelism, X-axis allelism and Y-axis allelism. The CVS g -RF algorithm employed a variable selection algorithm based on random forests as the core to calculate the importance of the characteristic variables. This method exhibited good robustness. Experimental results showed that the proposed method in this study has good predictive performance.
The value of this model lies in the following: (a) It provides a strong research foundation for the prediction of prognosis information for cancer patients by lncRNA-disease association.
(b) The simplified lncRNAs in the model were the closest to predicting the relationship between lncRNA and the disease, which provides a favorable research premise for subsequent studies of this association.