Comprehensive bioinformatics analysis reveals potential lncRNA biomarkers for overall survival in patients with hepatocellular carcinoma: an on-line individual risk calculator based on TCGA cohort

Background Accumulated evidences have demonstrated that long non-coding RNAs (lncRNAs) are correlated with prognosis of patients with hepatocellular carcinoma. The current study aimed to develop and validate a prognostic lncRNA signature to improve the prediction of overall survival in hepatocellular carcinoma patients. Methods The study cohort involved 348 hepatocellular carcinoma patients with lncRNA expression information and overall survival information. Through gene mining approach, the current study established a prognostic lncRNA signature (named LncRNA risk prediction score) for predicting the overall survival of hepatocellular carcinoma patients. Results The current study built a predictive nomogram based on ten prognostic lncRNA predictors through Cox regression analysis. In model group, the Harrell’s concordance indexes of LncRNA risk prediction score were 0.811 (95% CI 0.769–0.853) for 1-year overall survival, 0.814 (95% CI 0.772–0.856) for 3-year overall survival and 0.796 (95% CI 0.754–0.838) for 5-year overall survival respectively. In validation cohort, the Harrell’s concordance indexes of LncRNA risk prediction score were 0.779 (95% CI 0.737–0.821), 0.828 (95% CI 0.786–0.870) and 0.796 (95%CI 0.754–0.838) for 1-year survival, 3-year survival and 5-year survival respectively. LncRNA risk prediction score could stratify hepatocellular carcinoma patients into low risk group and high risk group. Further survival curve analysis demonstrated that the overall survival rate of high risk patients was significantly poorer than that of low risk patients (P < 0.001). Conclusions In conclusion, the current study developed and validated a prognostic signature to predict the individual mortality risk for hepatocellular carcinoma patients. LncRNA risk prediction score is helpful to identify the patients with high mortality risk and optimize the individualized treatment decision. The web calculator can be used by click the following URL: https://zhangzhiqiao2.shinyapps.io/Smart_cancer_predictive_system_HCC_3/. Electronic supplementary material The online version of this article (10.1186/s12935-019-0890-2) contains supplementary material, which is available to authorized users.


Background
Hepatocellular carcinoma (HCC), as a serious public health problem, is the sixth most common malignant tumor and ranks second in the causes of cancer related death [1]. Since HCC patients at early stage usually had no obvious symptoms, most HCC patients were diagnosed at advanced stage. Despite the great advances in terms of early diagnosis and clinical therapy, the overall survival (OS) of HCC patients remains unsatisfactory [2]. It has been reported that the actual 10-year survival rate was merely 7.2% after surgical resection through a meta analysis with 4197 HCC patients [3]. Therefore, a reliable prognostic signature is needed to monitor HCC patients with poor prognosis and subsequently optimize the clinical treatment decision.
Long non-coding RNAs (lncRNAs), as a class of RNAs > 200 nucleotides in length, may act important roles in biological processes [4,5]. Several lncRNAs have been reported to be correlated with survival of HCC patients [6,7]. Recently, several prognostic signatures based on lncRNA expression data have been built to predict the prognosis of HCC patients [8][9][10]. However, these were several limitations for clinical application of these previous prognostic signatures. Firstly, these prognostic signatures provided only simple scores of overall survival but not percentages of individual mortality risk. Secondly, it is too difficult to calculate the risk scores through these complicated prognostic signatures. Meanwhile, the difference and influence of different gene detection platforms and different transformation methods of original gene expression values should be taken into account for clinical application of these prognostic signatures.
Therefore, the present study aimed to build and validate a prognostic model to predict the prognosis of HCC patients using lncRNA expression data downloaded from The Cancer Genome Atlas (TCGA) database. The present study was carried out in accordance with the suggestions by Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) [11].

Protocol approval
The present study downloaded the original study dataset from The Cancer Genome Atlas (TCGA) database. The download and analysis of the study dataset strictly adhered to the relevant data policies of TCGA database.

The gene expression dataset
The gene expression dataset was downloaded from TCGA database (January 28, 2018, https ://tcga-data. nci.nih.gov/docs/publi catio ns/tcga/). The original gene expression data were generated on illumina HiSeq 2000 RNA Sequencing platform. The download gene expression dataset involved 371 hepatocellular carcinoma samples and 50 normal samples with 60,488 original gene expression values. The lncRNAs descripted in GENCODE Resource database (release 27, mapped to GRCh37, https ://www.genco degen es.org/) were selected for further study. There were 14,449 lncRNAs included in the present study for further analysis.

Differential expression analysis
The lncRNAs which original expression values < 1 were filtered out from the present study. Then the lncRNA expression values were further standardized through method of Trimmed Mean of M [12]. The criteria of differential gene selection were P value < 0.05 and |log 2 fold change| > 2.

Clinical dataset
There were 376 HCC patients in the clinical dataset from TCGA database. The study endpoint in the current study was overall survival. To avoid the effects of unrelated confounding factors, 20 HCC patients with overall survival less than 1 month were excluded from the present study. Eight patients without lncRNA expression information were excluded from the present study. Finally, there were 348 HCC patients enrolled the final survival analysis (Fig. 1). The study period of The Cancer Genome Atlas Liver Hepatocellular Carcinoma (TCGA-LIHC) cohort was from 2010 to 2015. The maximum value and the minimum value of the overall survival time were 120.7 months and 1.0 month. The missing data were recorded as "NA" in the present study. The mean ± standard deviation of age of HCC patients was 59.5 ± 13.4 years in model group. The mean ± standard deviation of follow-up period was 840 ± 701 days. There were 130 (37.4%) out of 348 HCC patients died in the follow-up period.

Internal validation
We carried out an internal validation to validate the predictive performance of the present prognostic model. The validation dataset was constructed by drawing 348 HCC patients using bootstrap resampling method, which was recommended for internal validation of prognostic model [13,14].

Statistical analysis
Continuous variables in the present study were presented as mean ± standard deviation (SD). The t-test or Mann-Whitney U test was performed to compare the differences of continuous variables as appropriate. The Chi-squared test or Fisher's exact test was performed to compare the differences of categorical variables as appropriate. Time-dependent receiver operating characteristic (ROC) curves and Harrell's concordance index (C-index) were performed to assess the predictive accuracy of prognostic models. The statistical analyses were carried out by using SPSS Statistics 19.0 (SPSS Inc., an IBM Company) and R software (version 3.4.4). The following R packages, such as "pROC", "plyr", "rms", "survival", "timeROC " and "glmnet ", were performed as appropriate in the present study. P < 0.05 was defined as the criteria of statistical significance.

Study group
Three hundred and forty-eight HCC patients were eventually included in the final survival analysis. The average age of 348 HCC patients was 59.5 ± 13.4 years and the average overall survival time of 348 HCC patients was 28.0 ± 23.7 months in the current study. One hundred and thirty (37.4%) patients out of 348 HCC patients died within the follow-up period in model group. The comparisons of basic characteristics between model group (Additional file 1) and validation  Table 1. There were no significant differences in terms of basic characteristics between model group and validation cohort.

Differential expression analysis
The differential expression analysis between 371 cancer samples and 50 normal samples was performed by using "edgeR" package. Through "edgeR" package, one thousand and five lncRNAs were identified for further survival analysis. The heat map was presented in Additional file 3: Figure S1 and volcano map was presented in Additional file 4: Figure S2.

Construction of prognostic nomogram
The univariate Cox regression analyses were conducted to screen the potential lncRNA predictors for overall survival of HCC patients. Based on the potential lncRNA candidates identified by univariate Cox regression analyses, ten lncRNA predictors for overall survival were finally ascertained through multivariate Cox regression analysis. The relevant model information of ten lncRNA candidates were presented in Table 2. The median values of lncRNA expression values were used as cut-off values to transform the original lncRNA expression values into "1" (as high expression) and "0" (as low expression).

Predictive performance of LncRNA risk prediction score
Through the median value of LncRNA risk prediction score, 348 patients in model group were stratified into low risk group (n = 174) and high risk group (n = 174). As shown in Fig. 3a, the overall survival rate of low risk patients was significantly higher than that of high risk patients (P < 0.001). The distribution of LncRNA risk prediction score was presented in Fig. 3b. The overall survival status and overall survival time were presented in Fig. 3c. The Harrell's concordance index (C-index) of LncRNA risk prediction score was 0.761 (95% CI 0.719-0.803) for overall survival in model group.

Internal validation of LncRNA risk prediction score
A internal validation cohort (n = 348) was drawn by random drawing with replacement method from model cohort (n = 348). The calculating method of LncRNA risk prediction scores for patients in validation cohort was as same as the previous formula of LncRNA risk prediction score in model cohort. Then 348 HCC patients in validation cohort were stratified into low risk group (n = 174) and high risk group (n = 174) through the previous cutoff value in model cohort. The survival curve analysis (Fig. 5a) indicated that the overall survival rate in high risk group was significantly poorer than that in low risk group (P < 0.001). The distribution of LncRNA risk prediction score was presented in Fig. 5b. The survival status and survival time were presented in Fig. 5c. The C-index of LncRNA risk prediction score was 0.745 (95% CI 0.703-0.787) for OS in validation cohort.

Independence assessment of LncRNA risk prediction score
Multivariate Cox regression analyses were carried out to explore the independence of LncRNA risk prediction score for OS of HCC patients. The pathological diagnosis was carried out in accordance with the suggestions of the American Joint Committee on Cancer (AJCC). After adjusting the confounding effects of pathological parameters, gender and age, multivariate Cox regression analyses indicated that LncRNA risk prediction score was an independent influence factor for OS of HCC patients (Table 3).

Survival curve analysis of ten lncRNAs in LncRNA risk prediction score
The survival curve analysis of lncRNAs in LncRNA risk prediction score was present in Fig. 7. As shown in Fig. 7, OS was significantly different according to ten lncRNAs in LncRNA risk prediction score (P < 0.001).

Pathological stage subgroup analysis
Pathological stage was an important influence factor for overall survival of HCC patients. As shown in Fig. 8, OS in high risk group was significantly poorer than that in low risk group in different pathological stages, indicating that the predictive performance of LncRNA risk prediction score for OS was stable and reliable in different pathological stage subgroups.

Functional enrichment analysis
According to the criteria of P value < 0.05 and |Spearman correlation coefficient| > 0.7, 162 mRNA genes were significantly co-expressed with prognostic lncR-NAs included in LncRNA risk prediction score. Functional enrichment analysis was performed through the Database for Annotation, Visualization and Integrated Discovery (DAVID, https ://david .ncifc rf.gov/). Gene   . 7 The survival curves of ten lncRNAs in LncRNA risk prediction score ontology (GO) biological process enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) signaling pathway analysis were presented in Fig. 9. Functional enrichment analysis indicated that the coexpressed genes were mainly enriched in mitotic nuclear division, cell division, DNA replication, DNA repair, regulation of cell cycle, DNA-dependent ATPase activity, and ATPase activity.

Ten-group risk stratification chart
To explore the predictive performance of LncRNA risk prediction score for OS, a 10-group risk stratification chart was presented in Fig. 10 for model cohort. The discriminative ability of LncRNA risk prediction score for 1 year, 2 year, and 3 year OS were showed in Fig. 10a-c.

Discussion
The current study developed and validated a prognostic model named LncRNA risk prediction score, which was helpful to predict the individual mortality risk and identify the patients with high mortality risk. LncRNA risk prediction score could help HCC patients with high mortality risk optimize their individualized clinical decision.
LncRNA risk prediction score, as a prognostic nomogram, provided a noninvasive preoperative predictive method for overall survival of HCC patients. The nomogram predictive chart has been used as predictive tool for prediction of prognosis in different cancers [15,16]. The present study constructed LncRNA risk prediction score for OS was based on the following points to consider: First, there is an urgent need for clinical practice to construct a preoperative predictive method to forecast the overall survival of HCC patients before further surgery. The HCC patients with high mortality risk identified by prognostic models would be more willing to accept active treatment such as surgical treatment. Second, for HCC patients without pathological diagnosis information, LncRNA risk prediction score could provide an alternative noninvasive predictive method for overall survival.
The previous prognostic models didn't present in the current study for the following causes [8][9][10]. First, these prognostic models were developed based on lncRNA expression values generated on different gene detection platforms. Due to the differences between different gene detection platforms, these prognostic models couldn't be calculated directly in the current study. Second, the previous studies further standardized the original lncRNA expression counts by using different standardization methods. The standardization methods in these previous studies reduced the repeatability and clinical applicability of these prognostic models.
The current study has the following advantages in predicting the overall survival of HCC patients: First, LncRNA risk prediction score, as a simple predictive nomogram, was easy to calculate and understand by patients. Second, the individual mortality risk was presented as percentage of mortality risk, which was easy to interpret the clinical significance of the predictive result for patients without medical knowledge. Third, since this prognostic nomogram didn't contain pathological parameters, LncRNA risk prediction score was a noninvasive predictive method and subsequently more suitable for preoperative prediction for OS.
There were several shortcomings in the current study. First, LncRNA risk prediction score has not been validated through external study dataset. Therefore it was necessary to validate the predictive performance of LncRNA risk prediction score in different external study population. Second, the sample size of the current study was relevant small and then large prospective multicenter