A robust eleven-genes prognostic model can predict overall survival in bladder cancer patients based on five cohorts

Jiaxing Lin China Medical University First Hospital https://orcid.org/0000-0003-3193-5522 Jieping Yang China Medical University First Hospital Xiao Xu Shengjing Hospital of China Medical University Yutao Wang China Medical University First Hospital Meng Yu China Medical University Yuyan Zhu (  yyzhu@cmu.edu.cn ) Department of Urology, The First Hospital of China Medical University https://orcid.org/0000-00030479-3151


Background
Bladder cancer is the tenth most common cancer in the world. It is more common in men than in women, and the morbidity and mortality rate in men is four times higher than that in women [1]. A signi cant risk factor for bladder cancer is smoking, with half of all cases are linked to smoking [2,3]. About 75% of patients with non-muscular invasive bladder cancer are treated by radical tumor resection, followed by intravesical instillation of Bacille Calmette-Guérin vaccine. Approximately 25% of patients have muscular invasive or metastatic bladder cancer, and are treated with radical cystectomy and neoadjuvant chemotherapy [4]. Bladder cancer is a complex disease. Although many clinical factors and molecular markers have been identi ed that can predict prognosis [5], these have low accuracy, and it does not have universal applicability.
With the continued development of gene sequencing technology and expansion of public databases, it is possible to take advantage of biological information to mine sequencing data and identify biomarkers.
This method can utilize large sample sizes with less investment, making it an important new direction to screen disease biomarkers. Of available databases, the Cancer Genome Atlas (TCGA, https://cancergenome.nih.gov/) database is an authoritative oncology database, and the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) database stores curated gene expression datasets. Many studies have constructed a multi-queue veri cation model based on these two databases, such as non-small cell lung cancer [6], and ovarian cancer [7]. Good models provide effective guidance for doctors and patients to make optimal treatment decisions. However, in the study of the bladder cancer model, many models can only be veri ed in two or three cohorts [8,9] and do not have clinical extensibility.
In this study, gene expression and clinical data related to bladder cancer were obtained from TCGA and GEO databases, and common prognostic genes were screened by univariate Cox proportional hazard regression. This prognostic model of bladder cancer was constructed by Lasso Cox regression and then veri ed using ve cohorts. This robust model can help patients with bladder cancer to achieve individualized treatment.

Data obtaining and processing
To reduce the error of the data, we searched the TCGA and GEO databases for bladder cancer cohorts with a sample size of more than 100, and these cohorts need to include survival status and survival time.
In R Programming Language software, the R package "edge" [14] was used to standardize the raw RNA expression matrix and obtain the corresponding log values.

Construction of prognostic model
The Cox proportional hazard regression model was applied to perform univariate Cox proportional hazard analysis of gene expression in each cohort. The hazard ratio (HR) from univariate Cox regression analysis was used to select the genes that were positively or negatively related to prognosis. A gene with HR > 1 was considered a risk gene, and a gene with HR < 1 was considered a protective gene; statistical signi cance was de ned as p < 0.05. The genes with HR > 1 and p < 0.05 were selected from the four cohorts, and then risk genes were obtained by overlapping four groups of genes. Similarly, genes with HR < 1 and p < 0.05 were selected for the four cohorts and combined to obtain the set of protective genes. A Venn diagram was constructed using the online tool Bioinformatics and Evolutionary Genomics (http://bioinformatics.psb.ugent.be/webtools/Venn/). The identi ed risk and protective genes make up a set of prognostic genes.

Page 4/25
The data from TCGA-BLCA as a training set was used to construct a prognostic model. To simplify the model, the genes were selected by univariate Cox regression analysis with a p-value less than 0.01. The R package "glmnet" [15] and "survival" were used to do Least absolute shrinkage and selection operator (Lasso) regression to further screen genes and construct a Cox module. First, the function "glmnet" was randomly simulated 1000 times to construct the model and establish the relationship between Lambda (punishment coe cient) and coef. A higher value of Lambda corresponds to greater punishment. With the increase of Lambda, some gene coef become zero, indicating that the expression of the gene will not affect the model, so this gene can be removed from the model. Then the function "cv.glmnet" was randomly simulated 1000 times for cross-validation (CV). CV is usually divided into hold-out, k-fold and leave-one-out CV. The function used k-fold CV, and k took the default parameter 10. In 10-fold cross validation, the data set is divided into 10 equal parts, and then nine part are tested as training sets and one is used as the validation set. The deviance of the 10 tests were used to estimate the accuracy of the model. When the deviance is minimum, the model is the best, and the coef of the model can then be obtained by using the corresponding Lambda value. Therefore, the model obtained by using Lasso regression is a model with excellent performance but the smallest number of independent variables. Finally, we obtained the genes and the corresponding coef to build the model. The prognostic model was de ned as: Risk score =∑ n i (exp i • coef i ) (where n is the number of genes, exp i is the expression of the ith gene, and coef i is the regression coe cient of the ith gene). Using the obtained model, we calculated the risk score of each patient in the four cohorts.
Kaplan-Meier analysis R packages "survival" and "survminer" were used for Kaplan-Meier analysis, and the function "res.cat" was used to nd the best cut-off value of factors. The cut-off was used to divide the sample into a high-risk group and a low-risk group to construct the Kaplan-Meier curve with the smallest p value. The risk score distribution, gene expression, and patient survival status data were plotted using the R package "pheatmap".
The set of mRNA genes in the TCGA-BLCA cohort with univariate Cox analysis values less than 0.05 were selected to construct a bladder cancer co-expression network by Weighted Gene Co-expression Network Analysis (WGCNA). The R package "WGCNA" was used to construct the co-expression network [16]. This method takes advantage of similarities of gene expression and groups the genes with similar expression patterns into the same module, with the idea that genes in the same module may share physiological function. We then explored the relationship between the clinical-factor/risk-score and module, and applied Pearson correlation to determine the module that was most related to the risk score. The key genes were selected by the calculated correlation between genes (module-membership > 0. 8), and the correlation between genes and clinical traits (Gene-signi cance > 0. 5).

Pathway and process enrichment analysis
Identi ed genes were entered into the Metascape database (http://metascape.org) [17] for pathway and process enrichment analysis. The enrichment analysis included "KEGG Pathway, GO Biological Processes, Reactome Gene Sets, Canonical Pathways, and CORUM" to evaluate the potential biological functions and pathways of the selected genes.
CIBERSORT and ESTIMATE algorithm CIBERSORT (Cell-type Identi cation By Estimating Relative Subsets Of RNA Transcripts) is a bioinformatics algorithm to calculate cell composition from gene expression pro les of complex tissues [18]. The combination of CIBERSORT and LM22 (leukocyte signature matrix) can be used to calculate the content of 22 kinds of human leukocyte subsets. We used the R package "CIBERSORT" to calculate the number of immune cells in each sample of the TCGA-BLCA cohort. ESTIMATE ("Estimation of STtromal and Immune cells in MAlignant Tumours using Expression data") is a tool that uses gene expression trends to infer the fraction of stromal and immune cells in tumor samples [19]. The immune score of each patient was calculated by the algorithm. The R package "estimate" was used to directly calculate the score of each sample in TCGA-BLCA cohort. Immune score represents the content of immune cells, and the higher the score, the higher the cell content.

Statistical Analysis
All the statistical analyses were carried out by using R Programming Language software (Rx64 3.5.1). All R packages were obtained from CRAN (https://cran.r-project.org) or BioConductor (http://www.bioconductor.org). The two groups were compared by the Wilcoxon test, and comparison between multiple groups was performed by Kruskal-Wallis test. Statistical signi cance was de ned as p < 0.05. Difference scatter plots were constructed using the R package "beeswarm". We used the R package "vioplot" to draw violin pictures.

Data processing and research design
We obtained the raw RNA sequencing and clinical data of TCGA-BLCA (n=412), GSE13507 (n=165), GSE32548 (n=146) and GSE32894 (n=308). R package "edge" was used to standardize the raw RNA expression matrix of the four cohorts. We utilized data only from patients associated with RNA sequencing data, survival time, survival status, and primary tumor for further analysis. The basic clinical information of the remaining patients is summarized in Table 1, the sample sizes of these four cohorts are all greater than 100. The grade of bladder cancer is closely related to recurrence and invasive behavior. Two grading methods were used in these four cohorts. TCGA-BLCA and GSE13507 used the WHO grading standard of 2004, which was divided into PUNLMP (Papillary urothelial neoplasms of low malignant potential), low grade, and high grade. GSE32548 and GSE32894 used the WHO grading standard of 1999, which was divided into grade 1 (G1), grade 2 (G2), and grade 3 (G3).
The genes related to the prognosis of bladder cancer were screened from each dataset, and the identi ed genes were combined to obtain a set of genes that were common to the four datasets. We used some of these genes to construct a prognostic model in the TCGA-BLCA cohort, and then tested the model with ve cohorts for veri cation. We also used the co-expression network, ESTIMATE, and CIBERSORT algorithm to explore the relationship between the model and tumor microenvironment. The research process is shown in Fig. 1.

Construction of prognostic model
Univariate Cox proportional hazard analysis was carried out in four cohorts, and genes affecting prognosis were obtained. There were 3301 genes in TCGA-BLCA, 440 genes in GSE13507, 2404 genes in GSE32548, and 2768 genes in GSE32894 that met the criteria of hazard ratio (HR > 1 and p < 0.05). Combining the four datasets allowed identi cation of twenty-four risk genes as shown in the Venn diagram ( Fig. 1). There were 4743 genes in TCGA-BLCA, 414 genes in GSE13507, 1717 genes in GSE32548, and 3248 genes in GSE32894 that met the criteria of (HR < 1 and p < 0.05). Combining the four datasets allowed identi cation of ten protective genes (Fig. 1). Together, these 24 risk genes and 10 protective genes constitute the prognostic gene set. Because of the large sample size of TCGA-BLCA, we used this cohort to build the prognostic model. First, genes with univariate Cox p-values less than 0.01 in TCGA-BLCA were selected. The 24 genes were analyzed by Lasso regression analysis (Fig. 2a), when the number of genes in the model was 11, the deviance was the smallest (Fig. 2b). According to the Lambda value, the corresponding regression coe cients (coef) of the selected 11 genes could be determined. The prognostic model could then be constructed by using the corresponding coef of the 11 genes. The model has good predictive performance and includes a small number of genes with the Lasso regression analysis. Finally, we successfully constructed a prognostic module: Risk score = SERPINE2*0.02 + PRR11*0.13 + FABP6*(-0.000318) + C16orf74*(-0.0564) + DSEL*0.107 + DNM1*0.0142 + COMP*0.0223 + TNK1*(-0.0972) + ELOVL4*0.00152 + RTKN*0.126 + MAPK12*0.0304. The basic information and coef values of the 11 genes are listed in (Additional e 1: Table S1). The results of univariate regression analysis of these 11 genes in 4 cohorts are shown in (Additional e 2: Table S2).
Kaplan-Meier analysis of 11 genes Eleven genes were taken Kaplan-Meier analysis in 4 cohorts. Using the heat map to show the results of the study (Fig. 2c), except for DSEL in GSE32894 and C16orf74 in GSE32548, the other analyses were statistically signi cant (p < 0.05). SERPINE2, RTKN, PRR11, MAPK12, ELOVL4, DSEL, DNM1, and COMP showed that the prognosis of patients with high expression was worse, and the analysis of ELOVL4 in TCGA-BLCA was taken as an example (p < 0.001, Fig. 2d). TNK1, FABP6, and C16orf74 showed that the prognosis of the low expression group was worse, and the analysis of FABP6 in TCGA-BLCA was taken as an example (p < 0.001, Fig. 2e).
Veri cation of the prognostic model The prognostic model was used to calculate the risk scores of each patient in the training set (TCGA-BLCA) and three test sets (GSE13507, GSE32548, and GSE32894). The function "res.cat" of R package "survminer" was used to nd the best cut-off value for Kaplan-Meier analysis. We identi ed the best cutoff value with a risk score of 2.40 for TCGA-BLCA. Using this method, the cut-off values of GSE13507 / GSE32548 / GSE32894 were 2.56 / 1.96 / 1.92. To observe the predictive ability of risk scores in their respective cohorts, we divided the samples into high-risk and low-risk groups according to the cut-off value of each cohort, and constructed the Kaplan-Meier curves. The results showed that the prognosis of patients with high-risk was signi cantly worse than that of patients with low-risk in the four cohorts (p<0.001, Fig. 3a-d). The Receiver Operating Characteristic (ROC) curves of the four cohorts were drawn: the 1 / 3 / 5 year Area Under the Curve (AUC) values for the TCGA-BLCA cohort were 0.686, 0.665, and 0.666, respectively (Fig. 3e); those for the GSE13507 cohort were 0.800, 0.742, and 0.697, respectively (Fig. 3f); those for the GSE32548 cohort were 0.826, 0.792, and 0.763, respectively (Fig. 3g); and those for the GSE32894 group were 0.781, 0.831 and 0.839 (Fig. 3h). Additional e 3: Figure S1 shows the risk score distribution, gene expression values, and survival status of patients in both the high-risk group and the low-risk group.
The clinical factors and risk scores of the four cohorts were analyzed by univariate Cox and multivariate Cox regression analysis ( Table 2). The results of univariate analysis showed that T stage was more effective in predicting prognosis among the clinical factors, and three cohorts had statistical signi cance. The risk scores were statistically signi cant in all four cohorts, and the p value of three cohorts was lower than that of the T stage. In multivariate Cox analysis, risk scores were statistically signi cant in three cohorts, indicating that the three cohorts were independent of other clinical factors in predicting prognosis. In this analysis, only two cohorts of T stage had statistical signi cance, so it is obvious that T stage is not as strong as risk score to predict the prognosis. Finally, we compared the risk scores for different grades and T stages in the four cohorts, and found that the risk scores increased with the increase of grade and T stage (p < 0.001, Additional e 4: Figure S2a, b). In the GSE32548 cohort, we compared the risk scores of FGFR3, PIK3CA, and TP53 (or with the MDM2 alteration) for wild type and mutant type, and found a lower risk score of mutant type than that of wild type for the FGFR3 and PIK3CA groups (p < 0.001, Additional e 4: Figure S2c). In TP53 (or with the MDM2 alteration), the score of mutant type was higher than that of wild type (P < 0.001, Additional e 4: Figure S2c). Seven genes and model were successfully veri ed in GSE48705 In order to further verify the robustness of this model, RNA sequencing and clinical data of GSE48075 were obtained from the GEO database. We selected the samples (n = 73) with transcriptional data, survival time, and survival status data for the following analysis. Firstly, the survival prediction ability of 11 genes was veri ed in the GSE48075 cohort. We used the R package "survminer" to nd the best cut-off value for 11 genes, and the samples were divided into two groups for Kaplan-Meier analysis according to the best cut-off values. The results showed that SERPINE2, RTKN, PRR11, MAPK12, ELOVL4, DSEL, and COMP were statistically signi cant (p < 0.05, Additional e 5: Figure S3), and the prognosis was worse in the high expression group which was consistent with the analysis result of the previous four cohorts. The risk score of patients was calculated according to the model, and the risk score was analyzed by Kaplan-Meier analysis. The prognosis of the high risk-score group was terrible, and the difference was statistically signi cant (p = 0.0028, Fig. 4a). We drew the ROC curves of the risk score, and the AUC value of 1/3/5 year was 0.676/0.630/0.755 (Fig. 4b). Fig. 4c showed the risk score distribution, gene expression values, and survival status of patients between high and low-risk groups.

Weighted co-expression network analysis and enrichment analysis
A co-expression network was constructed with the TCGA-BLCA data. The co-expression network was constructed with 3844 coding genes (with Cox regression analysis p-values less than 0.05) and 403 samples. First, the expression matrix was transformed into a topological overlap matrix according to β = 4. Then, the genes were divided into different modules (Fig. 5a) using the dynamic pruning tree method. Next, the association analysis of clinical traits and modules (Fig. 5b) showed a high correlation between the turquoise module and risk score (cor = 0.76, p = 2E-74). There was also a high correlation between the turquoise module and survival status (cor = 0.25, p = 9E-07) / grade (cor = 0.3, p = 1E-09) / stage (cor = 0.32, p = 1E-10). We selected genes in the turquoise module for further analysis and selected 128 key genes (Fig. 5c) according to gene-signi cance > 0.5 and module-membership > 0.8. To explore the potential function of these key genes, pathway and process enrichment analysis of these key genes were performed, as shown in Fig. 5d. The three most highly signi cantly enriched terms were extracellular matrix organization, collagen bril organization, and ECM proteoglycans, all related to the tumor microenvironment (TME).

Immune cells can be combined with risk scores for prognostic analysis
We used CIBERSORT to calculate the in ltration ratio of immune cells in 22 TCGA-BLCA samples and used a bar chart to show the in ltration of high and low-risk groups (Fig. 6a). Then, the Wilcoxon test was used to compare the difference between high and low-risk groups. The results showed that B cells naive, Macrophages M0, and Macrophages M1 showed high in ltration in the high-risk group; B cells memory, Dendritic cells resting, and Dendritic cells activated showed high in ltration in the low-risk group (P < 0.001, Fig. 6b). Furthermore, we took the risk score and the in ltration degree of these six kinds of immune cells for joint prognostic analysis. The samples were divided into four clusters for Kaplan-Meier analysis according to the median value of the risk score and immune cell in ltration degree. The results showed that these groups could also be used for prognostic analysis (p < 0.05, Fig. 6c-h). Among them, the prognostic ability of B cells memory is the best. When the degree of B cells memory in ltration is low, and the risk score is high, the prognosis of this cluster is signi cantly worse than that of other clusters. We used ESTIMATE to calculate the immune-score of the samples, and a high immune-score indicates a high degree of immune cell in ltration. We still used the immune score and risk score for prognostic analysis. The results showed that the cluster with low immune-score and high risk-score had the worst prognosis (Additional e 6: Figure S4).

Discussion
Bladder cancer is a heterogeneous disease with a high incidence and recurrence rate, but there is no robust predictive tool to guide clinical treatment [5]. In this study, prognostic genes were screened from four cohorts with the whole transcriptome, and the common prognostic genes were selected to construct the model. The model successfully predicted the overall survival of ve cohorts about 1000 bladder cancer patients, and it is the research with the largest cohort size in the same type of research.
A variety of regional source cohorts are used to jointly develop the model, which makes the model have higher credibility and broader applicability. In our study, all genes in all cohorts were then analyzed by univariate Cox proportional hazard analysis to screen common prognostic genes in four cohorts. After further screening, a prognostic model was constructed using the data from the TCGA-BLCA cohort.
Instead of using the genes obtained by analysis of a single cohort to construct a prognostic model, the prognostic genes common to multiple cohorts were used to make the model more stable and reliable. The patients in the TCGA-BLCA cohort were from North America, GSE13507 was from Asia, and GSE32548 and GSE32894 were from Europe. It is concluded that this model has a wide range of applicability.
The main nding of this study is that the 11-gene model we developed has a robust prognostic ability and successfully predicted the prognosis of ve cohorts. Kaplan-Meier analysis showed that the prognosis of the high-risk group was poor in all the four cohorts (p < 0.001). The 1-year AUC values of the TCGA-BLCA, GSE13507, GSE32548, and GSE32894 cohorts were 0.686, 0.800, 0.826 and 0.781 respectively, indicating that the risk score has a good ability to predict prognosis. Univariate and multivariate Cox analysis of clinical factors and risk scores showed that the ability of risk scores to predict prognosis was better than other clinical factors. These results clearly indicate the good prognostic ability of this 11-gene model in bladder cancer. We also analyzed the relationship between risk score and different clinical status, and found increased risk score with the increase of bladder cancer T stage and grade (p < 0.001). There are also signi cant differences in risk scores between wild type and mutant types of different genes. We analyzed the GSE32548 mutation data and found lower risk score both in the group of FGFR3 and PIK3CA mutation. In contrast, in the presence of a TP53 mutation (or with MDM2 alteration), the risk score was higher. According to previous reports, mutations in FGFR3 [20] and PIK3CA [21] are associated with good prognosis, but TP53 mutation (or with MDM2 alteration) is associated with poor prognosis [21,22]. These conclusions indirectly verify the prognostic ability of the risk model. Finally, the 11-gene model was successfully veri ed in independent cohort GSE48075. According to the Kaplan-Meier analysis of risk score, the prognosis of the high risk-score group was worse, and there was a signi cant difference. The model is veri ed by four internal cohorts and one external cohort, which shows that the model has the potential to be used in the clinic.
Eleven genes are potential prognostic markers and therapeutic targets for bladder cancer. These 11 genes have a stable prognostic ability in TCGA-BLCA, GSE13507, GSE32548, and GSE32894 cohorts. And Kaplan-Meier analysis showed that SERPINE2, RTKN, PRR11, MAPK12, ELOVL4, DSEL, and COMP was successfully veri ed in GSE48075. Among them, only C16orf74 and RTKN were previously reported to be associated with bladder cancer, while other genes were not reported and are worthy of future research for bladder cancer. SERPINE2 can enhance the tumor-promoting effect of ERK signal transduction in intestinal epithelial cells and is a potential therapeutic target for colorectal cancer [23]. This gene can also drive distant metastasis of breast cancer [24]. PRR11 is overexpressed in ovarian cancer [25], and has the potential to be used as a molecular marker. FABP6 is overexpressed in colon cancer and may play an important role in early carcinogenesis [26]. Decreased expression of C16orf74 is closely related to the progression of non-muscular invasive bladder cancer [27], and it may also be a potential therapeutic target for pancreatic cancer [28]. Most of the studies of DSEL are studies of congenital diseases, such as diaphragmatic defect [29] and Ehlers-Danlos syndrome [30], but there have been very few studies related to cancer. DNM1 is a kinetin-related protein that plays an important role in mitochondrial division [31].
COMP is a cartilage biomarker [32], and COMP mutation can cause pseudochondrodysplasia [33]. TNK1 is a tumor suppressor that can down-regulate the activity of Ras [34]. Studies have shown that RTKN is highly expressed in bladder cancer [35], and some experiments have shown that some miRNA can inhibit tumor growth by targeting RTKN [36]. MAPK12, one of four types of p38 MAPK, is a potential therapeutic target for colon cancer [37]. ELOVL4 is a member of the fatty acid elongation enzyme ELOVL family and is highly methylated in cancers such as hepatocellular carcinoma [38]. These genes may be involved in the essential biological process of bladder cancer and have great research value.
The combination of risk score and B cell memory can be used to analyze the prognosis of patients with bladder cancer. In the present study, the key genes positively related to the risk score were identi ed by WGCNA. The enrichment analysis of these genes showed that these genes were related to TME, indicating that the patients' risk score was closely related to TME. To further explore the relationship between risk score and TME, we calculated the in ltration degree of 22 kinds of immune cells in the sample. We found that there were differences in the in ltration degree of many immune cells between high and low risk. B cells memory stands out in the evaluation of combined immune cell and risk prognostic analysis, and the prognosis of patients is the worst in the case of low in ltration and high risk. We also used the ESTIMATE algorithm to calculate the overall immune score of the samples. The prognosis is poor in the case of low immune-score and high risk-score. Tumor-in ltrating lymphocytes have been reported as a useful prognostic factor for patients with bladder cancer [39], and B cells are a signi cant component of in ltration in these cells. B cell is a good prognostic factor in many cancers (such as high grade serous ovarian cancer) [40]. CD20 B cells have also been reported to be associated with longer survival in bladder cancer [41]. Bladder cancer has a good response to immunotherapy, but there is a lack of unique immune prognostic biomarker to guide treatment [42]. We combine B cell memory and risk score for prognostic analysis, which has good prognostic ability and potential for immunotherapy.
Although this 11-gene risk prognostic model can well predict the prognosis of bladder cancer, there are still several limitations to our conclusions. We used pre-existing data from available databases and did not verify the model with additional data. We did not nd the general cut-off value of the model, so when the model is applied to the clinic, it needs to conduct a large local sample study to nd the best cut-off for the cohort.

Conclusions
In short, we constructed a robust prognostic model of bladder cancer, which successfully predicted prognosis in 5 bladder cancers. The genes in this model also have the same prognosis in multiple cohorts, which is of great signi cance for research. When B cell memory and risk score are combined to group the samples, the cluster with obviously poor prognosis can be found. The model has strong stability and applicability, which can be used in the individualized treatment of bladder cancer to improve the prognosis. Flow chart of analysis.