A panel of Transcription factors identified by data mining can predict the prognosis of head and neck squamous cell carcinoma

Background Transcription factors (TFs) are responsible for the regulation of various activities related to cancer like cell proliferation, invasion, and migration. It is thought that, the measurement of TFs levels could assist in developing strategies for diagnosis and prognosis of cancer detection. However, due to lack of effective genome-wide tests, this cannot be carried out in clinical settings. Methods A complete assessment of RNA-seq data in samples of a head and neck squamous cell carcinoma (HNSCC) cohort in The Cancer Genome Atlas (TCGA) database was carried out. From the expression data of six TFs, a risk score model was developed and further validated in the GSE41613 and GSE65858 series. Potential functional roles were identified for the six TFs via gene set enrichment analysis. Results Based on our multi-TF signature, patients are stratified into high- and low-risk groups with significant variations in overall survival (OS) (median survival 2.416 vs. 5.934 years, log-rank test P < 0.001). The sensitivity and specificity evaluation of our multi-TF for 3-year OS in TCGA, GSE41613 and GSE65858 was 0.707, 0.679 and 0.605, respectively, demonstrating good reproducibility and robustness for predicting overall survival of HNSCC patients. Through multivariate Cox regression analyses (MCRA) and stratified analyses, we confirmed that the predictive capability of this risk score (RS) was not dependent on any of other factors like clinicopathological parameters. Conclusions With the help of a RS obtained from a panel of TFs expression signatures, effective OS prediction and stratification of HNSCC patients can be carried out.


Background
Head and neck squamous cell carcinoma (HNSCC) is a solid malignancy that is the sixth most common human cancer, with an annual incidence of more than 600,000 [1]. A combination of chemotherapy, radiotherapy, and adequate surgical resection has transformed HNSCC from a universally deadly disease to a potentially curable one; nevertheless, fewer than half of all patients are saved, with a 5-year survival rate < 50% [2]. Traditional stratification schemes based on multiple clinicopathological parameters such as the American Joint Committee on Cancer (AJCC) TNM staging system have been recognized as the primary criteria providing prognostic guidance for the management of patients with HNSCC [3,4]. Despite the ease of its implementation and its wide application, TNM staging is insufficient for forecasting prognosis and estimation for subsets of HNSCC patients, and individual variation of survival times within the same stage is considerable [5,6]. Risk scores (RS) that capture such individual variation might guide better therapeutic strategies. An increasing body of evidence suggests that molecular risk assignments could be used to promote prognostic assessment and identification of potential high-risk HNSCC patients [6][7][8][9].

Open Access
Cancer Cell International Proteins that bind to specific DNA sequences and control the transcription rate of genetic information from DNA to mRNA, are called Transcription factors (TFs) [7]. Their role is to regulate genes (turn on and off ) and ensure expression in the required cells at the appropriate time and at required quantities. Increasing amounts of evidence suggest that deregulation of TFs characterizes the majority of human cancers, and some have been associated with cancer diagnosis and prognosis [8,9]. For example, p53 is a tumor suppressor protein, and mutations of this gene can be detected in more than half of all human cancers [10]; c-Myc is another important oncogene that is overexpressed in some malignant cancer cells and has been associated with tumor progression and poor clinical outcome [11]. Because of the significance of TFs in many biological processes and their aberrant activity in human cancer, we hypothesized that expression patterns of TFs may act as potential prognostic biomarkers of cancer.
The current cancer sample datasets which can be accessed via the TCGA and other similar resources, are an abundant data source which can assist in the identification of biomarker signatures and predict disease outcomes [12,13]. In our study, an extensive evaluation of the RNA-seq data across a 502 HNSCC patient cohort was carried out with the help of available TCGA datasets. Using a univariate survival analysis (USA) and multivariate Cox stepwise regression (MCSR) algorithm, we identified six prognosis-related TFs. Based on their expression in the TCGA series, a prognostic model was built and validated in another independent series (GSE41613 and GSE65858). Further MCRA and stratified analysis was used to confirm if the multi-TF signature was an independent indicator of HNSCC. Our investigation will put forward new insights in methods of overall survival (OS) prediction in patients suffering from HNSCC.

Patient data extraction
Gene expression data for HNSCC were download from the TCGA (https ://cance rgeno me.nih.gov/) database. The HNSCC cohort comprised 502 tumor tissues and 44 adjacent normal tissues. The probe IDs were converted to gene symbols in these datasets based on their Ensembl gene IDs, generating a dataset including the expression values for each gene. Corresponding patient clinical data which includes the gender, age, alcohol consumption, histologic tumor grade, lymph node dissection, HPV status, TNM stage, PNI, ENE, and LVI are displayed in Additional file 1. The GSE41613 and GSE65858 data set was download from the GEO database as an external validation series. The microarray data of GSE41613 and GSE65858 were based on the Affymetrix Human Genome U133 Plus 2.0 Array platform and Illumina HumanHT-12 V4.0 expression beadchip, respectively. Probes were matched to the gene symbols with a manufacturer-provided annotation file.

Identification of predictive TFs
The RNA-seq data of HNSCC covered 18,101 coding genes containing 1639 TFs. The DESeq package in Bioconductor was used to screen the differentially expressed TFs in HNSCC (P adj < 0.05 and absolute log2FC > 1). TF expression values were transformed as the log2(x + 1) of normalized expression values for further analysis. After excluding patients without clinical survival information, 498 patients were chosen for the USA. TFs with a P-value of < 0.01 were selected for USA using the R survival package. TFs that passed this filter criterion were further analyzed with a multivariate Cox stepwise regression (MCSR) algorithm, as described previously [14,15]. At each stage in the process, the deletion of each variable was tested with the help of a chosen model fit criterion. Based on whether the loss of a variable gave statistically insignificant deterioration of the model fit (F test), the variables were deleted till a statistically significant loss of fit was seen. Based on the estimated regression coefficients in the MCRA and the selected TFs, a risk score was then developed to combine the expression levels of six TFs (HOXA1, ZNF662, LHX1, ZBTB32, MEIS1 and HOXB8) in HNSCC specimens. In this study, the six-TF signature was defined as a multi-TF signature.

Statistical analyses
According to the MSCR algorithm, the RS of individual patients were estimated, and they were split into highand low-risk subgroups based on the median RS cut off. This RS formula was further confirmed by the GSE41613 and GSE65858 dataset. Univariate Cox proportional hazards regression analyses was used to determine the predictive value of our multi-TF signature and other traditionally evaluated clinically relevant parameters, defining the hazard ratios and 95% confidence intervals. Multivariable Cox regression analyses were used to determine if the RS values were independent predictors in HNSCC patients. In stratified analysis, the prognosis power of our multi-TF signature in various clinical subtypes was determined by Kaplan-Meier analysis via log rank tests. The sensitivity and specificity of the RS was analyzed using receiver operating characteristic (ROC) analyses. For the log-rank tests, univariate survival analyses and multivariable Cox regression analyses. P < 0.05 was considered as statistically significant. All statistical analyses were performed with SPSS 24.0 (IBM, Armonk, NY, USA) and R 3.5.1.

GSEA analysis
A Java program (http://softw are.broad insti tute.org/ gsea/index .jsp) was used to perform GSEA with the MSigDB C2 CP: Canonical pathways gene set collection. Cytoscape (version 3.6.0) was employed to visualize this GSEA. Using this, we could investigate the relationship between particular gene sets and risk scores for all genes, and identify the most positively and negatively associated ones with such enrichment scores. Totally, 1000 random sample permutations were carried out, with a significance threshold of FDR < 0.1 and P < 0.05.

Results
To identify potential prognosis biomarkers, we analyzed gene expression profiles of HNSCC downloads from the TCGA database. Among the expression data for 18,101 mRNAs, expression values for 1639 TFs were extracted and calculated with R (DESeq). We compared gene expression levels between normal and HNSCC tissues and screened 258 dysregulated TFs in tumor tissues. Among these TFs, 110 were down-regulated and 148 were up-regulated relative to control tissues (Fig. 1a, b).

Development of a multi-TF predictive model in the TCGA series
To identify the TFs involved in HNSCC outcomes, gene expression profiles of 498 patients with available survival information were subjected to USA. We screened a panel of 24 TFs that were associated with OS (P < 0.01). Those 24 TFs were further assessed using an MCSR algorithm aimed at constructing a multi-TF signature that is predictive of survival time in the TCGA series. In this way, six TFs were screened out as a candidate signature. The detail information of these six TFs is shown in Table 1. Based on these six TFs, RS values were assigned to each patient as follows: RS = 0.18495* HOXA1 − 0.30561* ZNF662 + 0.16043* LHX1 − 0.26993* ZBTB32 − 0.25013* MEIS1 + 0.3754* HOXB8. Patients on either side of the median RS cut-off were split into high-and low-risk groups i.e., high risk-patients had a higher probability of dying earlier than the low risk ones (Fig. 2b). From a survival heatmap, we were able to see that MEIS1, ZNF662 and ZBTB32 were protective TFs that had increased expression in low-risk groups, while HOXA1, LHX1 and HOXB8 were risk-associated TFs that had increased expression in the high-risk individuals (Fig. 2c). Kaplan-Meier curves for high-and low-risk groups are shown in Fig. 2b. The median OS in low-risk patients was 5.934 years longer than the 2.416 years median OS of high-risk patients ( Fig. 2d; log-rank test P < 0.001). Sensitivity and specificity evaluation of our multi-TF model was carried out by time-dependent ROC analysis. In 3-year ROC curves, the area under the curve was 0.707, suggesting good predictive performance for 3-year OS (Fig. 2e).

Validation of the multi-TF signature
The predictive value of our 6-TF signature was validated in another independent HNSCC series obtained from GEO (GSE41613 and GSE65858) to confirm its   . 2 TFs risk score analysis of HNSCC patients in TCGA dataset. a The low and high score group for the TFs signature in patients; b the survival status and duration of HNSCC cases; c heatmap of the six prognostic-related TFs expression in HNSCC. The color from blue to red shows a trend from low expression to high expression; d the Kaplan-Meier curve for overall survival of two patient groups with high and low-risk groups in the TCGA set. The differences between the two curves were evaluated by the two-side log-rank test; e ROC analysis of the risk scores for overall survival prediction in the TCGA set. The AUC was calculated for ROC curves, and sensitivity and specificity were calculated to assess score performance reproducibility. The same prognostic RS model obtained from the TCGA series was used to calculate the RS for 97 patients in the GSE41613 dataset and 270 patients in the GSE65858 dataset. Depending on the median cut-off of RS, individuals were categorized into low-risk group and high-risk groups in two datasets. In GSE41613 datasets, the median OS in low-risk group was 2.334 years, while median OS in patients in high-risk group was 6.472 years (Fig. 3a). In GSE65858 cohort, the median OS in low and high-risk group was 3.479 years and 5.312 years, respectively (Fig. 3b). Moreover, the sensitivity and specificity evaluation of our multi-TF for 3-year OS in GSE41613 and GSE65858 was 0.679 and 0.605, respectively (Fig. 3c,  d). This result consistent with the findings described above, demonstrated that our multi-TF signature had good reproducibility in HNSCC.

Determination of independent predictive activity of the multi-TF signature
To further investigate whether the multi-TF signature was an independent predictor of HNSCC prognosis not tied to underlying clinicopathological parameters, we performed univariate and MCRA. In univariate Cox regression, we found we that the multi-TF signature (95% CI HR 2.119-3.668, P < 0.001), TNM staging (95% CI HR 1.186-1.701, P < 0.001), lymphovascular invasion (LVI, 95% CI HR 1.186-1.701, P = 0.002), perineural invasion (PNI, 95% CI HR 1.186-1.701, P < 0.001), and extranodal   (Table 2). Therefore, a data stratification analysis was carried out for patients based on PNI and ENE status. Our multi-TF signature showed effective prognostic power in the ENE (±) and PNI (±) subgroups ( Fig. 4a-d). We also tested the prognostic value of our multi-TF in various clinicopathological statuses that were not defined as independent prognosis factors, including TNM stage and LVI (Additional file 2: Figure S1a-d). Therefore, these results prove that our model can act as an independent predictor for the outcomes of HNSCC patients. More importantly, the results of the ROC analysis revealed that our multi-TF signature could efficiently predict OS, suggesting that this multi-TF signature is a superior predictive model when compared to the existing TNM staging (Additional file 3: Figure S2).

Identification of 6-TF signature correlated with biological pathways and processes
Using a GSEA, we explored those processes and signaling pathways that were associated with our 6-TF signature, using RS for classification. Cytoscape was used to visualize significant gene sets (FDR < 0.10, P < 0.05, Fig. 5, Additional file 4). We identified clusters of related genes associated with high-risk scores, including genes relating to regulation of cell cycle, regulation of metabolic process, apoptotic process, response to stimulus, immune system processes and mRNA catabolic processes. Therefore, we predict that the six prognostic TFs have an important functional role in the progression on tumors.

Discussion
Currently, one of the main parameters to help clinicians determine patient outcomes and plan treatments, is the TNM staging; nevertheless, variation in outcomes suggests that clinical features cannot fully account for phenotypes of different potential subtypes [3,4,16]. Oncogenesis is characterized by several stages that need modifications in gene expression programs [17]. TFs play important roles in controlling this. Therefore their dysregulation is a reason for the acquisition of tumor-associated properties [18]. Previous studies [19,20] reported that the expression patterns of TFs may be an effective means of grading tumor subtypes. However, to date, expression profiles based on TFs in HNSCC have not been clarified.
Our study was aimed at identifying a TF expression signature that could predict outcomes for HNSCC patients at individual levels. To this end, we evaluated the prognostic significance of all differentially expressed TFs in HNSCC that were chosen on the basis of USA of the RNA-seq data retrieved from TCGA. Unfortunately, the requirement to measure a number of genes, reduces the efficiency of prognostic biomarkers in clinical applications [21]. Therefore, using an MCSR algorithm, a multi-TF signature was identified. This was more effective than individual TFs as predictive potential was maximized while the number of predictors were reduced [14,15,21,22]. The results of MCSR suggested to us that we should construct a model consisting of six TFs that forecast the survival time of HNSCC patients. Among these TFs, HOXA1 was previously reported as an oncogene in HNSCC. Upregulation of HOXA1 promoted the migration and invasion of HNSCC cells via the EMT pathway. More importantly, high levels of HOXA1 were discovered to be linked with poor prognosis of HNSCC [23]. This finding accorded with our results. Another candidate HOXB8, similarly to HOXA1, was a member of HOX family that was found to be significantly linked with tumor metastasis and shorter overall survival in many human cancers [24][25][26]. Further investigation revealed that HOXB8 was a predictor of the effects of FOLFOX4 chemotherapy in metastatic colorectal cancer [27]. Therefore, we hypothesized that HOXB8 may act as an oncogene in HNSCC progression; further investigation of this hypothesis is needed. Aberrant expression of ZNF662 caused by epigenetic changes via DNA hypermethylation was a valuable biomarker of tumorigenesis and advanced HNSCC [28]. In our study, ZNF662 was expressed at low levels in HNSCC and was associated with shortened survival. Down-regulation of MEIS1 modulated the leukemic cell response to chemotherapeutic-induced apoptosis [29]. Additionally, LHX1 was reported as a driver gene of clear cell renal cell carcinoma proliferation, apoptosis, and promoting tumor growth [30]. In the present study, the up regulation of LHX1 was an indicator of poor prognosis of HNSCC. This suggests that MEIS1 may participate in the regulation of chemoresistance in HNSCC and may be potential targets for anti-HNSCC drugs in the future. A recent study showed that ZBTB32 facilitated transcriptional repressor Zpo2 targeting to the GATA3 promoter to downregulate GATA3 expression and activity. Modulation of GATA3 by ZBTB32 in turn caused the development of aggressive breast cancers [31]. In our study, loss of ZBTB32 was associated with shortened survival time in HNSCC.
Taken together, the Kaplan-Meier analyses and ROC analyses demonstrated that expression of these TFs was Previous simulations have shown that the prognostic models which are significantly linked with survival times in the training data set can also be developed when using entirely independent dataset [32]. In this study, the usefulness of this multi-TF signature was validated in the non-overlapping cohort in GSE41613 and GSE65858, indicating good reproducibility of this multi-TF signature in HNSCC.
Multivariate analysis showed that PNI and ENE were independent clinicopathological factors for predicting the risk of HNSCC. Perineural growth is an unusual means of tumor cells growth that is not least resistance; it indicated high risk of postoperative recurrence and was an important poor prognosis factor in HNSCC [33]. ENE was defined as tumor cells infiltrating extranodal tissues beyond the capsule of affected lymph nodes. It was a characteristic of more aggressive cancer and was associated with shortened survival [34]. In stratified analysis, we found that the multi-TF signature remained a powerful forecaster of prognosis within these subsets, suggesting that our multi-TF was independent of these important clinicopathological parameters. This result implied that our multi-TF signature has the potential ability to enhance clinical prognostic tests. This will assist in improving patient stratification and treatment planning accordingly in future trials.
As with all research, our study also has its limitations. For one, due limited data, out of the thousands of known and predicted TFs, we could only obtain 1639 gene expression profiles. In addition, some clinical information was incomplete, which made our study susceptible to the inherent biases. Finally, while GSEA was used to investigate biological processes associated