Expression profile analysis identifies key genes as prognostic markers for metastasis of osteosarcoma

Background OS is the most common malignant tumor of bone which was featured with osteoid or immature bone produced by the malignant cells, and biomarkers are urgently needed to identify patients with this aggressive disease. Methods We downloaded gene expression profiles from GEO and TARGET datasets for OS, respectively, and performed WGCNA to identify the key module. Whereafter, functional annotation and GSEA demonstrated the relationships between target genes and OS. Results In this study, we discovered four key genes—ALOX5AP, HLA-DMB, HLA-DRA and SPINT2 as new prognostic markers and confirmed their relationship with OS metastasis in the validation set. Conclusions In conclusion, ALOX5AP, HLA-DMB, HLA-DRA and SPINT2 were identified by bioinformatics analysis as possible prognostic markers for OS metastasis.

Background Osteosarcoma (OS) is the most common type of cancer that arises in bones and most people diagnosed with OS are under the age of 25 [1]. The incidence of OS in the general population is 2-3/million/year and the peak at the age of 15-19 is 8-11/million/year [2,3].
OS is characterized by early metastasis, poor prognosis without treatment [4], and more than 90% of patients die from lung metastasis before multiple chemotherapy. OS is currently undergoing multidisciplinary treatment, with approximately 15-20% of patients showing signs of metastasis at diagnosis, most in the lungs. Metastasis remains the leading cause of death in patients with OS, compared with 70% of patients with localized disease, and only about 20% becoming long-term survivors.
Previous studies have investigated mutational alterations or gene factors in an attempt to identify candidate OS driver oncogenes or tumors suppressors [5][6][7][8]. So far, for patients with metastatic OS, neither prognostic factors nor optimal treatment methods have been well established. Therefore, more attention must be paid to more precise risk assessment, not only for patient consultation, but also for determining treatment options based on reliable stratified criteria. In order to detect pulmonary metastasis OS early and improve poor survivorship, it is important to further explore more effective prognostic biomarkers and therapeutic targets.
Although research on biomarkers for metastasis within OS has recently expanded [9][10][11], the targets after any OS diagnosis within the clinic and suitable for various sequencing platforms remain sparse. Recent development of gene chips and high-throughput sequencing technology, have enabled the identification of key genes related to tumor progression and prognosis based on big data integration and bioinformatics. Weighted gene co-expression network analysis (WGCNA) is a systematic biological method that could identify highly synergistically altered gene sets and screen out therapeutic targets or candidate biomarkers based on the inherent characteristics of the gene sets and the correlation between gene sets and phenotypes.
Aiming at identifying and validating key genes in OS metastasis, the present study firstly identified associated module by WGCNA according to the gene expression profiles from GEO datasets and determined the differentially expressed genes between metastatic OS samples and non-metastatic samples. Subsequently, GO and KEGG pathway analyses were performed to determine the most significant pathways associated with OS metastasis.
Additionally, we constructed KM-curves and receiver operating characteristic (ROC) curves and screened the key genes related to OS prognosis. Moreover, univariate and multivariate Cox regression analysis were conducted to evaluate the predictive effect of the gene signature. Finally, we validated the gene signature using an external RNA-Seq expression data obtained from TARGET. The results may reveal the prognostic value of the gene signature for OS (Fig. 1).

Data Sources and Data Preprocessing
We downloaded standardized matrix profile (*series matrix.txt) of GSE21257 (a microarray dataset) and obtained patient information is a microarray from the Gene Expression Omnibus (GEO) database (Table 1) [12]. The platform of the dataset is the GPL10295 Illumina human-6 v2.0 expression beadchip. We removed probes not mapping to the Gene symbol using platform annotation file. For different probes corresponding to the same gene, their median expression values were taken as the final gene expression value.
Differentially expressed genes (DEGs) between OS samples with metastasis and those without metastasis were identified using the "limma" (linear models for microarray data) R package (False discovery rate (FDR) < 0.05 and absolute of log2fold change (FC) > 1) [13].

Constructing Dynamic Weighted Gene Co-Expression Network
We chose the 3000 most-varying genes for network construction and module detection.
Specifically, the median absolute deviation (MAD) was used as a robust measure of variability. The network was built based on the protocols of R package WGCNA [14,15].
We firstly clustered the samples to detect outliers. It appeared there was one outlier and we removed it by hand ( Supplementary Fig. 1A). Use PickSoftThreshold function to select β = 7 (scale-free R2 = 0.89) to build an adjacency matrix to make our gene distribution conform to scale-free networks based on connectivity for training set (Supplementary Based on criteria of MM > 0.8 and GS > 0.2, hub genes in the blue module were screended.
Cytoscape version 3.7.2 was used for network visualization. The above analysis is implemented using the R package "WGCNA".
Functional Annotation and Gene Set Enrichment Analysis (GSEA) R package clusterProfiler was used to conduct Gene Ontology (GO) [16] and Kyoto Encyclopedia of Genes and Genomes (KEGG) biological pathway over representation analysis for interesting module genes [17]. GO terms and KEGG pathways with adjust p < 0.05 were considered statistically significant pathways. The enrichment analysis was implemented in command line of GSEA [18,19]. An expression dataset and phenotype labels in the GSE21257 dataset were used to conduct GSEA analysis according to metastasis status (metastasis vs. non-metastasis). The data was then interrogated against reactome gene sets (1499 gene-sets) from MSigDB version 6.2 [18,20,21]. We set the cut-off criteria as gene set size > 15, Number of enriched gene sets that are significant, as indicated by a FDR of less than 25%.

Cox-regression based survival analysis
Univariate cox regression analysis was firstly performed to screen survival related genes.
Furthermore, the receiver operating characteristic (ROC) analysis was performed to evaluate the predicting efficiency of the gene risk score and the area under curve (AUC) was calculated. The genes with p-value < 0.05 as well as AUC > 0.85 were screened as candidate genes for next analysis. These candidate genes were further selected for predictive signature construction. Risk scores were calculated and included in multivariate regression analysis in a Cox proportional hazard regression model for survival analysis.
The Kaplan-Meier curve was used to visualize the survival probability for each group and p-value was calculated by the log-rank test. The survival analysis was implemented with package survival and survminer. The ROC analysis was performed using pROC package.

Statistical Analysis
Our study used a Wilcoxon rank sum test to compare continuous data between two groups.
a chi-square test or Fisher's exact test to test the difference between categorical variables. A P-value < 0.05 or a adjusted P-value < 0.05 was considered statistically significant. The Kaplan-Meier method and log-rank test was used to evaluate the correlation between gene expression and overall survival. The WGCNA method was analyzed by Pearson correlation analysis. All of these processes were conducted by R software (version 3.5.1 (x64)).

Identification of key modules associated with OS metastasis
After data preprocessing and quality evaluation, an expression matrix with 3000 most varying genes and 52 OS samples with clinical information in GSE21257 was used for gene co-expression network construction ( Fig. 2A). After merging similar modules, we were able to identify a total of six modules and each module was designated by distinct colors to distinguish between modules (Fig. 2B).

Functional annotation and GSEA
We conducted GO function and KEGG pathway analyses to examine the potential functional significance of the genes within blue module. Biological process of GO analysis showed that blue module was mainly enriched with cell migration, cell proliferation, cell cycle and immune response related pathway (Fig. 3A). Figure 3B  were negatively associated with the overall survival of OS patients (Figs. 4C). Moreover, the expression levels of these 4 genes were significantly higher in OS patients with metastasis, compared with non-metastasis patients (Fig. 4D). In addition, the diagnostic performance of these 4 genes was evaluated by ROC curves. The AUC showed that ALOX5AP, HLA-DMB, HLA-DRA and SPINT2 indicated excellent diagnostic efficiency for patients with metastasis and those with non-metastasis (Fig. 4E). Figure 4F showed that ALOX5AP, HLA-DMB, HLA-DRA and SPINT2 were highly connected in the network and demonstrated that the 4 genes play an important role in the development of OS.

Evaluation and Validation of 4-gene signature for survival prediction
To investigate whether the 4-gene signature could provide an accurate prediction of overall survival in OS patients, the 4-gene signature risk score were calculated for each patient in the training set according to the expression of these 4 genes for OS prediction.
Then patients were divided into high-and low-risk groups using the median risk score as the cutoff. As expected, risk model might be a diagnostic marker for OS with an AUC of 0.861 (Fig. 5A) and patients with high-risk scores had a poor prognosis than those with low-risk scores (p = 0.0088) (Fig. 5B). As such, the 4-gene signature was validated using OS data from TARGET, and we achieved consistent results. Kaplan-Meier curves revealed that the high-risk scores of 4-gene signature were significantly associated with shorter overall survival time of OS patients (p = 0.043) (Fig. 5C), which were similar to those observed in the training series. In order to further evaluate whether the expression levels of these four genes can provide good prognostic value, a multivariate Cox regression analysis was performed. The results can be seen in Table 2. It was evident that risk scores calculated from these four gene signature remained a strong independent prognostic factor for patients with OS (p = 0.02). year survival rate has risen to 55%-70%. Preoperative adjuvant chemotherapy followed by radical resection is still the most effective treatment. If surgical resection is not possible, radiotherapy may be beneficial for controlling local tumors. As a generality, metastasis is the most adverse factors at diagnosis among known prognostic factors [22]. There are large differences in survival between patients with metastatic OS (10-20%) and nonmetastatic OS (50-78%) [26,27]. Moreover, metastatic OS are still very difficult to control and there are few effective therapeutic targets. Kinase targets, immune checkpoint inhibitors and cell surface marker GD2 have been actively investigated in multiple current clinical trials, but are inadequately evaluated. Therefore, further studies on early diagnosis or prediction of metastasis are warranted.
In our study, multiple bioinformatics analysis tools were used to identify 4 key genes related to metastasis and prognosis of OS patients, thus we constructed a risk score model which may benefit the treatment and prognosis evaluation of OS.
Using GO, KEGG and GSEA, we annotated the function of genes in the key module most related with metastasis, and clarified the underlying mechanism of metastasis in OS. Our results revealed that these genes were found to be enriched in cell cycle, cell proliferation, cell migration and immune response. Some researchers had demonstrated the functional link between cell cycle disorder and cancer cell invasion and metastasis [13,23,24]. Several small pilot studies have reported that expression of molecules of tumor cell immune response, particularly HLA class II, can induce anti-tumor T cell responses, which may affect tumor progression and survival time of patients [25][26][27].
Hence, we suggested that genes in blue module probably involved in the development and metastasis of OS through cell cycle pathway and immune response pathway.
After screening and filtering, we obtained four genes that may predict OS metastasis and

Ethics approval and consent to participate
Not applicable

Consent for publication
Not applicable.

Competing interests
The authors report no conflicts of interest in this work.

Funding
This work was supported by grants from the National Natural Science Foundation of China   The eigengenes were mainly clustered into two clusters, containing 2 modules (modules green and blue) and 3 modules (modules brown, turquoise and yellow), respectively. Figure 1 Flow chart of study design.

Supplementary Files
This is a list of supplementary files associated with the primary manuscript. Click to