Identification of a competing endogenous RNA network associated with prognosis of pancreatic adenocarcinoma

Background Emerging evidence suggests that competing endogenous RNAs plays a crucial role in the development and progress of pancreatic adenocarcinoma (PAAD). The objective was to identify a new lncRNA-miRNA-mRNA network as prognostic markers, and develop and validate a multi-mRNAs-based classifier for predicting overall survival (OS) in PAAD. Methods Data on pancreatic RNA expression and clinical information of 445 PAAD patients and 328 normal subjects were downloaded from The Cancer Genome Atlas (TCGA), International Cancer Genome Consortium (ICGC) and Genotype-Tissue Expression (GTEx). The weighted correlation network analysis (WGCNA) was used to analyze long non-coding RNA (lncRNA) and mRNA, clustering genes with similar expression patterns. MiRcode was used to predict the sponge microRNAs (miRNAs) corresponding to lncRNAs. The downstream targeted mRNAs of miRNAs were identified by starBase, miRDB, miRTarBase and Targetscan. A multi-mRNAs-based classifier was develop using least absolute shrinkage and selection operator method (LASSO) COX regression model, which was tested in an independent validation cohort. Results A lncRNA-miRNA-mRNA co-expression network which consisted of 60 lncRNAs, 3 miRNAs and 3 mRNAs associated with the prognosis of patients with PAAD was established. In addition, we constructed a 14-mRNAs-based classifier based on a training cohort composed of 178 PAAD patients, of which the area under receiver operating characteristic (AUC) in predicting 1-year, 3-year, and 5-year OS was 0.719, 0.806 and 0.794, respectively. The classifier also shown good prediction function in independent verification cohorts, with the AUC of 0.604, 0.639 and 0.607, respectively. Conclusions A novel competitive endogenous RNA (ceRNA) network associated with progression of PAAD could be used as a reference for future molecular biology research.


Introduction
Pancreatic adenocarcinoma (PAAD), a common digestive tract tumor, is the eighth leading cause of cancer death in the world. The 5-year survival rate for pancreatic cancer is only 5%, due to its rapid metastasis, high degree of malignancy and difficulty in early diagnosis [1]. Many studies have shown that chronic pancreatitis, gallstones, alcohol, and smoking were the most common risk factors for PAAD [2][3][4][5]. Serum carbohydrate antigen (CA  is the most common tumor marker of PAAD. But due to its low sensitivity and accuracy in clinical practice, many patients cannot be diagnosed in the early stage of cancers [6,7]. Although much progress has been made in early diagnosis, including new diagnostic markers, such as: sTAR, miR-25, etc., but the overall survival rate for PAAD has not improved in recent years [8,9]. Therefore, it is urgent to find more biomarkers as early diagnostic indicators for PAAD, guiding the treatment of the disease. During the occurrence and development of diseases, multiple interacting RNAs could be changed. So figuring out the mechanism of RNA interactions would help us to understand pathogenesis of diseases [10]. In recent years, many papers have studied and reported the interaction between non-coding RNA and mRNA based on the ceRNA hypothesis. In general, a variety of RNAs are involved in ceRNA regulatory networks, including lncRNA, miRNA, circular RNA and mRNA. MiRNA can bind to the 3′ untranslated region of mRNA in order to inhibit the translation of mRNA [11]. The ceRNA hypothesis suggests that lncRNA competitively binds with microRNAs through microRNA response elements (MREs) and inhibits the down-regulation of mRNA translation by miRNAs [12]. Moreover, the role of lncRNA-miRNA-mRNA ceRNA regulatory networks has been revealed in different diseases, such as chronic myeloid leukemia, cardiovascular diseases, diabetic cataract, and myocardial infarction [13][14][15][16], which might impact the prognosis of the diseases. However, on the basis of high-throughput sequencing and large samples, only a few literatures have reported the role of ceRNA in PAAD.
In the current study, therefore, we aimed to identify a new lncRNA-miRNA-mRNA network as prognostic markers for PAAD. The RNA-Seq data of 178 and 267 cases of PAAD tissues were obtained from TCGA and ICGC, and 328 cases of normal pancreas tissues from GTEx, respectively. Then, mRNAs and lncRNAs between the normal samples and PAAD patients were applied to WGCNA to enrich modules which were most related to PAAD. The target genes of miRNAs were predicted by the staBase, miRDB, miRTarBase and Targetscan. A multi-mRNAs-based classifier and a lncRNA-miRNA-mRNA ceRNA network containing 60 lncRNAs, 3 miR-NAs and 3 mRNAs influencing the prognosis of PAAD were developed (Fig. 1).

Different genes expression between the normal samples and PAAD tissues
We collected RNA-seq data of pancreatic tissue from 178 PAAD patients and 328 normal samples from TCGA and GTEx. EdgeR was used to normalized the gene read counts to the trimmed mean of M values (TMM). As shown in volcano map, with − log10 (false discovery rate, FDR) as the y-coordinate and log2 (fold change, FC) as the x-coordinate, 822 mRNAs were up regulated and 2362 mRNAs were down regulated (Fig. 2a). We applied these up-regulated mRNAs with Gene Ontology (GO) to investigate their potential function. In biological processes (BP), these up-regulated mRNAs were enriched in the muscle system process, divalent inorganic cation homeostasis, cellular metal ion homeostasis, cellular divalent inorganic cation homeostasis (Fig. 2c). As shown in Fig. 2d, e, the cellular component (CC) and molecular function (MF) analysis results also showed specifically up-regulated mRNAs enrichment. Gene symbols of up-regulated mRNAs and their interaction network in BP were shown in Fig. 2b. To investigate the role of these genes in various biological pathways, they were analyzed using Kyoto Encyclopedia of Genes and Genomes (KEGG)-Gene Set Enrichment Analysis (GSEA). The results showed that up-regulated genes were enriched in glycine, serine and threonine metabolism and starch and sucrose metabolism, while down-regulated genes were enriched in ascorbate and aldarate metabolism, pentose and glucuronate interconversions, retinol metabolism and thiamine metabolism (Fig. 2f, g).

MRNAs modules
As shown in Fig. 3a, we used WGCNA and set the thresholds to soft power 12-30 and module size cut-25. Therefore, we divided the 3184 differentially expressed mRNAs into different modules according to the expression pattern, and a total of 13 co-expressed gene modules were obtained (Fig. 3b). Figure 3c showed the corresponding heatmap plot of topological overlap matrix (TOM). Among the 13 gene co-expression modules, green Keywords: Pancreatic adenocarcinoma, Competing endogenous RNA network, Time-dependent receiver operating characteristic module was the most relevant to the clinical characteristics of PAAD, containing 981 genes (Fig. 3c). Then we performed GO-GSEA analysis of the genes in this gene module (Fig. 3d, e), indicating that the genes were significantly related to extracellular stimulus, nutrient levels and peptide metabolic process. As shown in Fig. 3f, KEGG analysis showed that genes of green module were enriched in pancreatic secretion.

LnRNAs modules
The volcano map of lncRNA showed that 427 lncRNA were up-regulated and 2261lncRNA were down-regulated (Fig. 4a). WGCNA was performed to analyze the co-expression of the up-regulated lncRNA, and a total of 12 modules were obtained (Fig. 4b). Among them, grey module which contains 423 genes, has the strongest correlation with PAAD patients (Fig. 4c). Then we used miRcode to predict the sponge miRNAs corresponding to the 423 genes, and overlapped with the 400 highest expressed miRNAs in the patient miRNA-seq downloaded from TCGA. There were overlapped 325 miRNAs. Through starBase, miRDB, miRTarBase and Targetscan, 4122 downstream targeted mRNAs of the 325 overlapping miRNAs were obtained. Taken together, we obtained four sets of mRNAs, including: 822 significantly up-regulated mRNAs and 2362 significantly down-regulated mRNAs by edgeR analysis, 981 mRNAs in the WGCNA-green module, and 4122 predicted target mRNAs. After overlapping the four groups of mRNAs, we got 98 up-regulated mRNAs and 63 down-regulated mRNAs (Fig. 4d, e). Expressions of 161 mRNAs in 178 PAAD patients and 328 normal samples were displayed in heatmap (Fig. 4f ).

Multi-mRNAs-based classifier
In order to construct a multi-mRNAs-based classifier for predicting OS in PAAD, 178 PAAD patients from TCGA were included in the training cohort, and 267 PAAD patients from ICGC were included in the validation cohort. LASSO COX regression method was used to identifying the mRNAs for predicting OS in the training cohort, and finally 14 mRNAs were selected (Fig. 5a, b). As shown in Fig. 5c, ADAM9, ARHGAP42, DDX60, EFNB2, ERAP2, GMNN, KYNU, OAS1 and SERTAD4 were up-regulated while DMD, DTNA, ING5, MTCP1 and TRIM52 were down-regulated in PAAD (Fig. 5d). The correlation between these 14 genes was shown in Fig in the training cohort and validation cohort were stratified into high and low risk groups according to the 14-mRANs-based classifier (Fig. 6a, c). In addition, we used KEGG-GSEA to analyze gene enrichment in the high-risk group of the training cohort, the results show that the main enrichment of genes associated with tumor pathway, such as adheres junction, P53 signaling pathway, pancreatic cancer, TGF-β signaling pathway and so on (Additional file 1: Figure S1).
According to the survival curve, patients in the lowrisk group have a more satisfactory OS than the high-risk group (P < 0.0001, Fig. 6b, d). In the training cohort, there was no significant difference in the distribution of clinical data between the low-risk group and the high-risk  (Table 1). Time dependent receiver operating characteristic (tdROC) was used to analyze the accuracy of the classifier in predicting 1-year, 3-year, and 5-year OS in training cohort and validation cohort. The AUC in the training volume was 0.719, 0.806, and 0.794, and the AUC in the validatuin cohort was 0.604, 0.639, and 0.607 (Fig. 6e, f ) (Table 2).

Multi-lncRNAs-mRNAs-based classifier
To assess whether the classifier composed of multiple type RNA has better prediction capability, we tried to use lncRNA and mRNA to establish a multi-RNA type classifier. The LASSO COX regression method was applied to 60 lncRNAs in the ceRNA network and 14 mRNAs in the mRNAs-based classifier, and finally screened out 16 RNAs. Then a 16-genes-based classifier was constructed (Additional file 2: Figure S2A Figure S2C). In the time-dependent ROC curve, the classifier could effectively predict the 1-year, 3-year and 5-year survival rates of patients, with the AUC of 0.778, 0.857 and 0.819, respectively (Additional file 2: Figure S2D). Calculated by the TIMEROC software package and the survival software package, the differences in AUC of 1 year, 3 years, and 5 years of the two classifiers were not statistically significant, and both the P-value and adjusted P-value were greater than 0.05 (Table 3).

Patients data
We first downloaded RNA-seq, miRNA-seq and clinical information data of pancreatic tissue of PAAD patients from TCGA data repository (up to September 6, 2019, https ://porta l.gdc.cance r.gov/), which are publicly available resources. After excluding samples without sufficient data, we obtained a total of 178 tumors samples as a training cohort. We also obtained RNA-seq data of 267 PAAD patients from ICGC data base (up to September 6, 2019, https ://icgc.org/) as a validation cohort. RNAseq data of 328 normal subjects were downloaded from GTEx database (up to September 6, 2019, https ://www. gtexp ortal .org/home/index .html). The genes were reannotated by the rtracklayer package of The R software (Version3.6.1). The gene annotation file "Homo_sapiens. GRCh38.91.CRH.gtf " was downloaded from the Ensembl Genomes website (http://asia.ensem bl.org/index .html).

Identification of differentially expressed genes
To distinguish differentially expressed genes in mRNA and lncRNA data sets, the edgR package was performed through R software version 3.6.1. All q values use FDR to correct the statistical significance of the multiple test.
In the result, if the gene of|log2FC| > 2 and FDR < 0.05, the gene will be retained for the following analysis. The ggplot2 package in R platform was used to generate the volcano plot, so as to more intuitively show the difference in mRNA and lncRNA expressions.

Functional enrichment analysis
We used GO to analyze the enrichment of differentially expressed mRNA from biological process (BP), cellular component (CC) and molecular function (MF) [17]. KEGG-GSEA was used to look for gene enrichment in metabolic pathways, and the significance level was P < 0.05 [18].

Weighted gene co-expression network analysis (WGCNA)
WGCNA, a system biology method, can identify highly synergistic gene sets and can analyze the genes most relevant to disease based on the interlinkage of gene sets and the association between gene sets and phenotypes. In this paper, WGCNA was used to cluster genes with similar expression patterns to obtain the modules most relevant to the clinical phenotype of PAAD patients.

Construction of the multi-RNAs-based classifier
In order to select the best lncRNA and mRNA from the high-dimensional data to construct the classifier, we chose LASSO COX regression method. LASSO, a method of filtering variables, sets a penalty function to make certain coefficients zero, and some coefficients are also compressed. Thus, achieving the purpose of shrinking subsets [19]. The RNAs related to OS were identified by LASSO COX regression model. The regression coefficients (β) are used to form the multi-RNAs-based classifier = ∑ EXP(RNA) * β. After the risk scores of all patients were calculated by LASSO cox, the patients were divided into high-risk group and low-risk group with the median as the critical value. Time dependent receiver operating characteristic and the area under receiver operating characteristic curve were used to evaluate the predictive ability of the model, both of which were realized by the RMS package.

Discussion
Among pancreatic tumors, PAAD has the highest incidence, accounting for about 85% of all cases, while other subtypes such as pancreatic neuroendocrine tumors account for less than 5% [20]. Because PAAD patient do not show special clinical symptoms at an early stage and cannot be detected and treated early, the 5-year survival rate is the lowest among all pancreatic tumor types [21]. It is of great importance to identify the potential molecular markers and therapeutic targets for PAAD. Therefore, in this study, we conducted a bioinformatics analysis of the data downloaded from the GCTA, ICGC and GTEx databases, and finally established a specific ceRNA network and a 14-mRNAs-based classifier for predicting the OS of PAAD patients. A large number of studies have shown that non-coding RNAs, such as lncRNAs and miRNAs, played an important regulatory role in promoting or inhibiting the occurrence and development of cancer [22,23]. In 2011, the ceRNA hypothesis proposed by Salmena et al. showed that lncRNA, miRNA and mRNA formed a regulatory network [12]. According to the ceRNA hypothesis, many researchers have demonstrated that lncRNA, miRNA and mRNA interact with each other as corresponding ceRNA networks during disease processes. Moreover, Wang et al. constructed a pancreatic cancer mRNA-miRNA-lncRNA sub-network by mining the GSE16515 and GSE15471 datasets, filling the gap in this area [24]. However, there were still some shortcomings in his research: First, there were only 72 PAAD and 52 normal samples included in the study. Second, only two databases were used to predict miRNA and lncRNA upstream of the mRNA. In our study, in order to make the study more credible, we used PAAD patient data from two databases, TCGA and ICGC, to establish the training cohort and the validation (See figure on next page.) Fig. 6 Multi-mRNAs-based classifier. a, c PAAD patients in the training cohort and the validation cohort were classified into predicted low and high-risk groups according to the multi-mRNAs-based classifier. b, d Kaplan-Meier survival analysis of the multi-mRNAs-based classifier was performed. e, g Time-dependent ROC curve and calibration curves of the multi-mRNAs-based classifier in the training cohort. f Time-dependent ROC curve of the multi-mRNAs-based classifier in the validation cohort cohort, which respectively contained 178 and 267 samples. Moreover, data from four databases was used to predict target genes of miRNA. In addition, LASSO is a frequently used statistical method for multicollinearity problems. We used LASSO COX regression models for the screened RNAs and obtained the more refined prediction model through compression estimation to predict the OS of PAAD. In this article, we have established two different classifiers to predict the OS of patients with PAAD. Clinical RNA data from tumor tissues of PAAD patients could be combined with our established classifier to predict the prognosis of patients. Therefore, clinicians could use the predicted results as reliable reference when formulating treatment strategies for patients. The classifier based on 14 mRNAs had a high accuracy in both the training cohort and validation cohort. Furthermore, starting with these 14 mRNAs, we obtained 60 lncRNAs, 3 miRNAs and 3 mRNAs, thus constructing a specific ceRNA network. After that, we screened lncRNA in the ceRNA network and 14 mRNA of the original classifier to build a new 8-lncRNAs-8-mRNAs-based classifier. Comparing the AUC of the two classifiers, the predictive function of the two classifiers was similar, which proved that the mRNA signature alone was robust enough to discriminate patients' prognosis without the integration of non-coding RNA. In addition, our GSEA analysis of the high-risk group showed that genes were mainly enriched in p53 signaling pathway, TGF-β signaling pathway and pancreatic tumors. In the study of Bailey P et al., the gene network of the four subtypes of pancreatic tumors all involved in TGF-β signaling pathway, among which the squamous subtype with significant P53 gene mutation has the worst prognosis [25]. This result was consistent with the low overall survival rate of the high-risk group, which also verified the accuracy of the 14-mRNAs-based classifier in predicting the OS of PAAD patients.
In recent years, some genes included in our ceRNA regulatory network have been reported in literature to be associated with regulating the development of tumors. ADAM9 is highly expressed in pancreatic ductal adenocarcinoma and is closely related to vascular invasion of cancer cells [26]. Zhu et al. reported that EFNB2 promoted the differentiation, migration and invasion of cancer cells in pancreatic ductal adenocarcinoma [27]. FEN1, a direct target of miR-140-5p, is related to the epithelial-mesenchymal transition of hepatocellular  carcinoma cells [28]. MiR-125a-5p has been reported to inhibit the expression of GAB2 and Bax proteins, and ultimately inhibited the proliferation and invasion of breast cancer cells [29]. MiR-20b-5p has been extensively studied in many cancers, such as lung cancer and colorectal cancer [30,31]. However, the role and mechanism of RNAs such as DTNA, miR-125-5p, miR-20b-5p in the occurrence and development of PAAD still need further investigation. Furthermore, GO-GESA for green module obtained by WGCNA analysis was performed to identified metabolic pathways enriched by the genes within the module. The result showed that those mRNAs were enriched in endoplasmic reticulum (ER) stress. Study had shown that both ER stress and fold-protein response are up-regulated in pancreatic neuroendocrine tumors, which can be used as therapeutic targets for pancreatic tumors [32,33]. Moreover, these mRNAs were also enriched in the response to the nutrient levels pathway and peptide metabolic process pathway. Abnormalities in the processes of energy metabolism, protein metabolism and fat metabolism of tumor tissues had been the focus of research in recent years [34,35]. Analysis of the KEGG pathway revealed that these genes had significantly enrichment in the pancreatic secretion pathway. The above results indicated that these specific enriched pathways are closely related to the pathological mechanism of PAAD and could be further studied.
There were still some limitations to our study. First, although we used independent samples to validate the 14-mRNAs classifier, we did not validate the 8-lncRNAs-8mRNAs-classifier due to the lack of sufficient lncRNA data by ICGC. Second, the mechanisms of those lncR-NAs, miRNAs and mRANs, included in the ceRNA interaction network, were still unclear and required further research. Third, prospective studies of larger samples in different regions need to be used to verify the accuracy of the models.

Conclusions
We developed a lncRNA-miRNA-mRNA co-expression network from a multi-omics perspective. In addition, a multi-mRNAs-based classifier with high accuracy was built to predict the OS of PAAD patients. The lncRNA-miRNA-mRNA ceRNA network. The green hexagon represents lncRNA, the red diamond represents mirRNA, and the blue oval represents mRNA Table 3 Comparison of AUC of 14-mRNAs-based classifier and 16-genes-based classifier