Integrated analysis reveals potential long non-coding RNA biomarkers and their potential biological functions for disease free survival in gastric cancer patients

Background Increasing evidences supported the association between long non-coding RNA (lncRNA) and disease free survival in gastric cancer (GC) patients. The purpose of the current study was to construct and verify a noninvasive preoperative predictive tool for disease free survival in GC patients. Methods There were 265 and 300 GC patients in model dataset and validation dataset respectively. The associations between the lncRNA biomarkers and disease free survival were evaluated by univariate and multivariate Cox regression. Results Thirteen lncRNA biomarkers (GAS5-AS1, AL109615.3, KDM7A-DT, AP000866.2, KCNJ2-AS1, LINC00656, LINC01777, AC046185.3, TTTY14, LINC01526, LINC02523, LINC00592, and C5orf66) were identified as prognostic biomarkers with disease free survival. These thirteen lncRNA biomarkers were combined to construct a prognostic signature for disease free survival. The C-indexes of the current predictive signature in model cohort were 0.849 (95% CI 0.803–0.895), 0.859 (95% CI 0.813–0.905) and 0.888 (95% CI 0.842–0.934) for 1-year, 3-year and 5-year disease free survival respectively. Based on thirteen-lncRNA prognostic signature, patients in model cohort could be stratified into high risk group and low risk group with significant different disease free survival rate (hazard ratio [HR] = 7.355, 95% confidence interval [CI] 4.378–12.356). Good reproducibility of thirteen-lncRNA prognostic signature was confirmed in an external validation cohort (GSE62254) with HR 3.919 and 95% CI 2.817–5.453. Further analysis demonstrated that the prognostic significance of thirteen-lncRNA prognostic signature was independent of other clinical characteristics. Conclusions In conclusion, a simple noninvasive prognostic signature was established for preoperative prediction of disease free survival in GC patients. This prognostic signature might predict the individual mortality risk of disease free survival without pathological information and facilitate individual treatment decision-making. Electronic supplementary material The online version of this article (10.1186/s12935-019-0846-6) contains supplementary material, which is available to authorized users.


Introduction
As a serious challenge to public health, gastric cancer (GC) is the fifth most common cancer and the third leading cause for cancer associated mortality in the world [1]. It was estimated that there were approximately 1,033,701 GC patients occurred and 782,685 GC patients died in 2018 [1]. Despite the improvements of diagnosis and treatments, the prognosis of GC patients remains undesirable [2,3]. The TNM stage system of the American Joint Committee on Cancer (AJCC) was insufficient for prognostic prediction of GC patients [4,5]. Increasing evidences demonstrated that the molecular biomarkers were helpful for improvement of prognostic prediction and early identification of GC patients with high mortality risk [6][7][8]. Thus, it is necessary to develop a valuable preoperative prognostic signature to identify GC patients with high mortality risk and improve the clinical treatment decision.
Long non-coding RNAs (lncRNAs) are RNAs that length range from 200 nucleotides to multiple kilobases but lack of protein-coding function [9]. Emerging evidences have revealed that lncRNAs could provide valuable prognostic information for GC patients [10][11][12]. Several studies have developed lncRNA-based prognostic signatures for disease free survival (DFS) in GC patients [13][14][15]. However, these lncRNA-based prognostic signatures had several limitations for preoperative prediction of disease free survival: Firstly, the calculation formulas of these lncRNA-based prognostic signatures were too complex for clinical application. Secondly, the prognostic significances of absolute scores of these previous prognostic signatures were difficult to understand and interpret for users without medical knowledge. Thirdly, these three prognostic signatures lacked external validation. Fourthly, Tian et al. constructed a lncRNA-based prognostic signature for 3-year DFS by combination of data of gene expression and pathological parameters. However, for patients with advanced GC cancer or who were unwilling to undergo surgery, the pathological parameters were unobtainable for calculation of prognostic signature and thus seriously limited the clinical application of this prognostic signature. Thus, it is necessary to develop and validate a simple noninvasive signature for preoperative prediction of prognosis in GC patients.
Nomogram was a method of displaying calculation results by percentage scale diagram and was used for prognostic prediction in different cancer patients [8,16]. Nomogram could easily attain the predictive percentage of study outcome without complex calculation. Therefore, to construct a simple prognostic signature for disease free survival in GC patients, the current study developed and validated a prognostic signature by using nomogram method. We carried out the current study and reported the results in accordance with the guidelines of Transparent Reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) [17].

Protocol approval
The study datasets in the current study were obtained from The Cancer Genome Atlas (TCGA) database (https ://cance rgeno me.nih.gov/) and the Gene Expression Omnibus (GEO) database (https ://www.ncbi.nlm. nih.gov/gds/). The current study downloaded and analyzed the study datasets according to the data policies of TCGA database and GEO database. The study datasets obtained from TCGA database and GEO database were fully anonymous and therefore the ethics approval was not required.

The model dataset
The model dataset was downloaded from TCGA database (https ://tcga-data.nci.nih.gov/docs/publi catio ns/ tcga/). The gene expression values were generated by using the Illumina HiSeq 2000 RNA Sequencing platform. In the current study, the selected lncRNA IDs were defined according to GENCODE Version 29 (https ://www.genco degen es.org/human /). Finally, there were 14,449 lncRNAs obtained from 375 tumor specimens and 32 normal specimens in model dataset. The original gene expression counts were converted to 0 for low expression and 1 for high expression according to the median values of original gene expression values. The clinical information of 443 GC patients in model dataset were downloaded from cBioPortal database (http:// www.cbiop ortal .org/datas ets). The GC patients without disease free survival information were excluded from the current study (n = 120). In order to avoid or reduce the interferences of confounding factors, there were 10 GC patients excluded from the current study because disease free survival time less than 1 month (n = 10). After interaction between gene dataset and clinical dataset, there were 265 GC patients with gene expression information and survival information included in the model cohort (Fig. 1). The maximum DFS time was 122.21 months and the minimum DFS time was 1.02 month in model cohort. The study time was from January 13, 2002 to November 24, 2014. The missing data were coded as "NA" in the model dataset. The mean ± standard deviation (SD) age of GC patients in the model cohort was 64.4 ± 10.6 years. Ninety-eight (37.0%) patients out of 265 GC patients died within the follow-up period (mean ± SD: 618 ± 564 days).

The validation dataset
We explored and identified potential study datasets in GEO database according to the following criteria: (1) gene expression values detected by using the Affymetrix Human Genome U133 Plus 2.0 array; (2) DFS time and DFS status were provided in the corresponding studies; (3) patient number more than 100. Finally, we identified GSE62254 as an independent external validation dataset [18]. The gene expression information and corresponding clinical information were downloaded from the GEO database (http://www.ncbi.nlm.nih.gov/geo/). There were Fig. 1 The flowchart in the current study. TCGA, The Cancer Genome Atlas 300 GC patients with gene expression information and clinical information from GSE62254 (https ://www.ncbi. nlm.nih.gov/geo/query /acc.cgi?acc=GSE62 254).
The gene expression information was detected on Affymetrix Human Genome U133 Plus 2.0 chip platform. The patient selection flow chart was described in Fig. 1. The Affymetrix HG-U133 Plus 2.0 probe set IDs were mapped to the Ensembl gene IDs by using the platform background file of GPL570 (https ://www.ncbi.nlm.nih.gov/geo/query /acc.cgi?acc=GPL57 0).
Based on the lncRNA IDs defined in GENCODE Version 29 (https ://www.genco degen es.org/human /), 1978 lncRNAs with corresponding gene symbols from 300 GC patients were extracted and analyzed for further survival study.

Development and validation of the prognostic nomogram
The prognostic nomogram and the corresponding calibration plots were generated by using "rms" package of R software. Calibration plots were performed to evaluate the predictive performance of the prognostic nomogram. The predicted survival and observed survival were plotted on the x-axis and y-axis respectively. The 45-degree line represented the best predictive curve. Time-dependent ROC curves were conducted to assess the predictive performance of the prognostic nomogram by using the "pROC" package.

The decision curve analysis
The decision curve analysis (DCA) was performed to evaluate the clinical utility of the prognostic nomogram for disease free survival. The DCA is a method for evaluation and comparison of the predictive value between different prediction models [19][20][21]. The x-axis of DCA represented the percentage of threshold probability, and the y-axis represented the net benefit of the predictive model. The net benefit was calculated according to the following formula: Net benefit = (True positives/n) − (False positives/n) * (p t /(1 − p t ).

Functional enrichment analyses
To explore the potential biological functions of lncRNAs included in the prognosis signature, the co-expressed mRNA were obtained through the model dataset according to the thresholds of P value < 0.05 and |spearman correlation coefficient| > 0.5. Functional enrichment analyses of these co-expressed mRNAs were performed by using the Database for Annotation, Visualization and Integrated Discovery (DAVID, https ://david .ncifc rf.gov/) [22].

Statistical analysis
Continuous variables were displayed as mean ± standard deviation (SD). Continuous variables were compared by t test or Mann-Whitney U test as appropriate. Categorical variables were compared by using Chi squared test or Fisher's exact test as appropriate. To explore the potential associations between lncRNAs and DFS, univariate Cox regression analyses and multivariate Cox regression analyses were performed to identify the prognostic biomarkers for development of prognosis signature. The GC patients were divided into two subgroups according to the scores generated by the prognosis signature. Kaplan-Meier analysis was used to compare the difference of DFS between high risk group and low risk group. The predictive performance and clinical utility of prognostic signatures were evaluated by using the Harrell's concordance index and time-dependent receiver operating characteristic (ROC) curve. The statistical analyses in the present study were conducted by SPSS Statistics 19.0 (SPSS Inc., an IBM Company) and R software (version 3.4.5). The R packages, including "pROC", "plyr", "rms", "survival", "tim-eROC " and "glmnet ", were used as needed in the current study. P value < 0.05 was defined as statistically significance in the current study.

Study cohort
The flow chart of patient selection in the current study was showed in Fig. 1. There were 265 GC patients in the model group (Additional file 1) and 300 GC patients in the validation group (Additional file 2). There were 98 (37.0%) patients out of 265 patients died within the follow-up period in the model group, whereas there were 161 (53.7%) patients out of 300 patients died within the follow-up period in the validation group. The basic clinical characteristics of GC patients in the model group and validation group were presented in Table 1.

Development of prognostic signature
The univariate Cox proportional regression analyses were carried out to explore the potential lncRNA predictors for disease free survival in GC patients. There were 249 lncRNAs that were significantly related with DFS in the model group. According to the multivariate Cox regression analyses, there were 13 lncR-NAs identified as independent biomarkers for disease free survival in GC patients. The relevant estimated regression coefficients of these 13 prognostic lncR-NAs were showed in Table 2. Therefore, a prognostic signature was conducted according to the following formula: This nomogram could be used to predict the individual mortality risk. The value of each lncRNA was calculated according to the corresponding scale. The total score was obtained by adding the values of these thirteen lncRNA. The total score was projected to the probability of DFS in 1 year, 3 year, and 5 year respectively.

Distribution characteristics of prognostic signature score
For displaying the distribution characteristics of prognostic signature score, the violin plot (Fig. 3a), density plot (Fig. 3b), scatter plot (Fig. 4a), the interaction distribution scatter plot among DFS time, DFS status, and predictive value (Fig. 4b) were presented in Figs. 3 and 4.

Clinical utility of prognostic signature
According to the median value of prognostic signature score, GC patients in model cohort (n = 265) were stratified into high risk group (n = 133) and low risk group (n = 132). The disease free survival rate (Fig. 5a) in high risk group was significantly poorer than that in low risk group (P < 0.001). The cumulative proportion surviving at 1-year, 3-year, and 5-year were 94.3%, 81.7%, and 77.1% in low risk group, whereas it were 59.2%, 27.8% and 11.3% in high risk group respectively (all P < 0.001). The Harrell's concordance indexes (C-indexes) of prognostic signature for disease free survival in the model group were 0.849 (95% CI 0.803-0.895), 0.859 (95% CI 0.813-0.905) and 0.888 (95% CI 0.842-0.934) for 1-year disease free survival, 3-year disease free survival and 5-year disease free survival respectively (Fig. 5b). The calibration curves  for 1-year, 3-year, and 5-year disease free survival were presented in Fig. 6.

External validation of prognostic signature
We validated the clinical utility of prognostic signature by using GSE62254 dataset. The prognostic signature scores for GC patients in validation dataset were calculated according to the previous prognosis signature formula in model dataset. Then the GC patients in validation set (n = 300) were divided into low risk group (n = 150) and high risk group (n = 150). The cumulative proportion surviving at 1-year, 3-year, and 5-year were 89.3%, 70.1%, and 62.3% in low risk group, whereas it were 42.4%, 24.1% and 21.4% in high risk group respectively (all P < 0.001). Kaplan-Meier analysis (Fig. 7a) indicated that there was significant difference in term of DFS between low risk group and high risk group in validation set (P < 0.001). The C-index of prognostic signature for disease free survival in the validation group were 0.870 (95% CI 0.834-0.906), 0.822 (95% CI 0.786-0.858) and 0.816 (95% CI 0.780-0.894) for 1-year, 3-year, and 5-year disease free survival respectively (Fig. 7b). The calibration curves for 1-year, 3-year, and 5-year disease free survival were presented in Fig. 8 for validation cohort.

Independence assessment for prognostic significance of prognostic signature
We carried out multivariate Cox regression analyses to explore whether prognostic signature was independent to other clinical parameters for DFS in GC patients. The pathological diagnosis was performed according to the suggestions of American Joint Committee on Cancer (AJCC). Table 3 indicated that prognostic signature was an independent risk factor for DFS after adjustment for confounding effects of gender, age, and pathological stage in the model group. In the validation group, multivariate Cox regression analyses demonstrated that prognostic signature, age, and AJCC stage were independent risk factors for DFS.

Functional enrichment analyses
According to a threshold of P value < 0.05 and |spearman correlation coefficient| > 0.5, there were 1280 mRNA genes that significantly co-expressed with the prognostic lncRNAs in the prognostic signature. Functional enrichment analyses were carried out through the Database for Annotation, Visualization, and Integrated Discovery (DAVID) Bioinformatics Resources (https ://david .ncifc rf.gov/). The gene ontology (GO) biological process enrichment analyses and the Kyoto Encyclopedia of Genes and Genomes (KEGG) signaling pathways were presented in Fig. 9. Functional enrichment analyses of the 1280 co-expressed mRNA genes demonstrated that the co-expressed mRNA genes were mainly enriched in regulation of transcription, RNA splicing, protein ubiquitination, cellular response to DNA damage stimulus, cilium assembly, cilium morphogenesis, centrosome organization, G2/M transition of mitotic cell cycle, regulation of transcription from RNA polymerase II promoter.

The decision curve analysis (DCA)
As shown in Fig. 10, the prognostic signature (red line) had the higher net benefit than the pathological stage (green line). The decision curve analyses indicated that prognostic signature could gain more benefit than either the dead-all-patients scheme or the dead-nonepatients scheme for prediction of 1-year DFS (Fig. 10a), 3-year DFS (Fig. 10b), and 5-year DFS (Fig. 10c). Clinical impact curve (Fig. 10d) depicted the prediction of risk stratification of 1000 patients by using resample bootstrap method. "Number high risk" indicated the number of patients classified as positive (high risk) by prognostic signature according to various threshold probabilities. "Number high risk with event" was the true positive patient number according to various threshold probabilities.

Ten-group risk stratification chart
To explore the predictive performance of prognostic signature for DFS, a 10-group risk stratification chart was presented in Fig. 11. For model cohort, the discriminative ability of prognostic signature for 1 year, 2 year, and 3 year DFS were showed in Fig. 11a-c. For validation cohort, the discriminative ability of prognostic signature for 1 year, 2 year, and 3 year DFS were showed in Fig. 11d-f. Figure 11 demonstrated that patients in lower risk groups had higher survival probability and patients in higher risk groups had lower survival probability.

Discussion
Accurate and reliable prognostic prediction is of critical importance for individualized treatment decisionmaking of GC patients. The current study developed and validated a thirteen-lncRNA prognosis signature for prognostic prediction of GC patients. The thirteen-lncRNA prognostic signature was proved to be helpful for individual mortality risk prediction and risk stratification of GC patients in an independent external dataset.
In the current study, the thirteen-lncRNA prognosis signature scores were calculated for prediction of DFS in GC patients in both model set and validation set. Poorer DFS were significantly related with high prognosis signature scores in both model set and validation set, demonstrating that the clinical performance of prognosis signature was stable and reliable for prognostic prediction of GC patients. Multivariate Cox regression analyses demonstrated that prognostic signature was an independent risk factor for DFS in both model set and validation set. Thus the thirteen-lncRNA prognosis signature was helpful to identify the patients with high mortality risk and improve the individualized clinical decisionmaking of GC patients. The current prognostic signature has a good prospect for clinical application. All parameters in the current prognostic signature were generated by gene detection method, indicating that this prognostic model is a noninvasive method and can be used before surgery. Through this prognostic model consisting of thirteen prognostic lncRNAs, doctors and patients can pre-estimate the risk of death in the next 5 years. This prognostic information is valuable for patients to decide whether to receive surgical treatment or not. Ten-group risk stratification chart in the current study demonstrated that patients in lower risk groups had higher survival probability and patients in higher risk groups had lower survival probability.
GAS5-AS1 was an independent prognostic factor for Hepatocellular carcinoma patients and could be considered as a potential prognostic biomarker [23].
The reduced GAS5-AS1 was significantly correlated with larger tumor, higher TNM stage, and lymph node metastasis for non-small cell lung cancer patients [24]. TTTY14 was significantly correlated with overall survival for GC patients and the prognostic value of TTTY14 was independent to other clinical features [25]. LINC00592 was a potential cancer related lncRNA in cervical cancer and might activate the cancer progression through the regulation of transcription or structural integrity [26]. LincRNA C5orf66-AS1 hypomethylation was significantly associated with overall survival in patients with the squamous cell cancer in the head and neck region [27]. The aberrant hypermethylation-mediated downregulation of C5orf66-AS1 might play an important role in gastric cardia adenocarcinoma tumorigenesis and might serve as a potential prognostic biomarker in predicting gastric cardia adenocarcinoma patients' survival [28].
The previous three prognostic signatures calculated the risk scores by using original gene expression values generated on different gene detection platforms and different standardized methods. The different gene detection platforms and standardized methods reduced the repeatability of research results and hindered the clinical application of prognostic signatures in other population.

Table 3 Univariate and multivariable Cox regression analyses for independence assessment of prognostic signature
The median of prognostic signature scores was used as the cut-off value to stratify gastric cancer patients into high risk group and low risk group Therefore the current study couldn't carry out the previous three prognostic signatures due to the different gene detection platforms and standardized methods. To improve the clinical application of the current prognostic signature in other study population, the thirteen-lncRNA prognostic signature scores in the current study were calculated based on the converted dichotomous values. This dichotomous conversion was helpful to eliminate the obstacles of different detection platforms and standardization methods. Therefore, the thirteen-lncRNA prognostic signature was more suitable for clinical prognostic prediction than the previous three prognostic signatures.

Advantages of the current study
Firstly, this prognostic nomogram can directly provide the individual mortality percentage forGC patients, which is important for improvement of individualized treatment decision-making. Secondly, the thirteen-lncRNA prognostic signature is easy to calculate and understand by users without medical knowledge and professional calculation tool. Thirdly, for patients without pathological diagnosis or unwilling to undergo surgery, the thirteen-lncRNA prognostic signature provides a simple non-invasive preoperative predictive method for prognosis of GC patients, which is of clinical significance for individualized clinical decision-making before surgery.

Limitations of the current study
As a clinical study by using study datasets downloaded from public databases, the model dataset and validation dataset did not contain detailed study information of drug regimen and other postoperative treatments, which might influence the therapeutic effect and prognosis. Additionally, the results in the current study depended on gene mining approach and lacked evidences from clinical experimental researches. Therefore, it is necessary to carry out large prospective studies to elucidate the relationship between the prognostic lncRNA biomarkers and DFS in GC patients.

Conclusion
Taken together, a simple noninvasive prognostic signature was established for preoperative prediction of disease free survival in GC patients. This prognostic signature might predict the individual mortality risk of disease free survival without pathological information and facilitate individualized treatment decision-making.