An individualized immune prognostic signature in lung adenocarcinoma

Background Tumor immune infiltration is closely associated with clinical outcome in lung cancer. We aimed to develop an immune signature to improve the prognostic predictions of lung adenocarcinoma (LUAD). Methods We applied “Cell type Identification by Estimating Relative Subsets of RNA Transcripts” method to quantify the fraction of 22 leukocyte cells from six public microarray datasets. Four datasets from GPL570 were treated as the training cohort and two datasets from GPL96 and GPL10379 as the validation cohorts. An immune risk score (IRS) based on leukocyte cell fraction was established by least absolute shrinkage and selection operator cox regression model. Results IRS consisting of 6 types of leukocytes was constructed in the training dataset. In the training cohort (520 patients), the IRS stratified patients into high-IRS group (215 patients) and low-IRS group (305 patients) with significant differences in overall survival (OS) (HR: 2.77, 95% CI 2.08–3.06). Multivariate analysis including age, gender, stage, IRS and tumor purity revealed the IRS to be an independent prognostic factor in all datasets (training: HR: 10.71, 95% CI 5.72–20.07; validation-1: HR 2.68, 95% CI 1.15–6.27; validation-2: HR 3.71, 95% CI 1.33–10.33); all p < 0.05). IRS was significantly positively correlated to the expression levels of PD1, PDL1, CTLA and LAG3 (all p < 0.001). When integrated with clinical characteristics including stage and age, the composite immune and clinical signature presented with improved prognostic accuracy than IRS (mean C-index 0.66 vs. 0.60). Conclusions The proposed immune-clinical signature could predict OS in patients with LUAD effectively.


Background
Non-small cell lung cancer accounts for 85% of all lung cancers, the most common cancer and cause of cancerrelated mortality world widely [1]. Lung adenocarcinoma (LUAD) is the most diagnosed histological subtype of non-small cell lung cancer [2,3]. Due to the presence of metastatic disease at an early stage, the prognosis for patients with LUAD is generally poor, with average 5-year survival rates of < 20% [4]. Conventionally, clinical decisions regarding cancer treatment and prognosis are based primarily on the AJCC staging system [5].
However, increasing evidence has revealed the clinical importance of tumor-infiltrating immune cells in lung cancer [6][7][8][9][10], combining the survival impact of immune cells with the AJCC staging system could enable clinicians to predict patient survival outcomes more accurately. Therefore, understanding the immune components by gene expression-based algorithms may be helpful for promoting studies of immune response in LUAD. The availability of public genomic datasets provides an ideal resource for large-scale gene expression analysis to identify reliable lung cancer biomarkers [11].
High resolving power is a key benefit of "Cell type Identification by Estimating Relative Subsets of RNA

Open Access
Cancer Cell International Transcripts" (CIBERSORT), which applies LM22 signature matrix to quantify the relative proportions of 22 immune cell types [12]. Because of the superiority of CIBERSORT algorithm over other methods regarding noise, closely related cell subsets and unknown cell types, it has received increasing attention and has been successfully applied to quantify the composition of immune cells in colon, breast, liver cancer and LUAD [13][14][15][16][17].
Therefore, we used the estimated proportions of 22 leukocytes derived from microarray gene expression data to construct and validate an IRS for patients with LUAD. To combine the complementary value of IRS for overall survival (OS) with clinical characteristics, we integrated the IRS with clinical factors to develop a composite prognostic signature, which showed improved prediction of LUAD prognosis.

Re-analysis of microarray data
Six GEO datasets were downloaded as probe-level CEL files. Then, the microarray data were normalized using Robust multiarray average (RMA) method with the affy and simpleaffy packages. The datasets used in the training cohort were quantile normalized after adjusting for batch effects using "combat" function (sva package, R 3.5.3) [24].

Estimation of immune cell type fractions
Gene expression data were subsequently analyzed using the LM22 gene signature and CIBERSORT method to estimate the fractions of 22 tumor infiltrating leukocytes subsets [12]. The CIBERSORT algorithm is well developed and has been verified by fluorescence-activated cell sorting [12]. CIBERSORT derives a p value for the deconvolution of each sample using Monte Carlo sampling, providing a measure of confidence for the results. Patients with a CIBERSORT output of p < 0.05 indicated that the results of the estimated fractions of immune cell populations can be considered accurate [14]. For each tumor sample, the final CIBERSORT output estimates were normalized and the sum of all estimates of immune cell type fractions yields to one.

Study population and clinical variables
Samples with CIBERSORT p value ≥ 0.05 were excluded, as were those with normal and non LUAD samples and patients for whom survival information or relevant clinical information was unknown. Clinical information including age, gender and TNM stage was collected. In this study, tumors were staged following the seventh edition of the AJCC staging system [25]. "Estimation of STromal and Immune cells in Malignant Tumours using Expression data" (ESTIMATE) algorithm was applied to calculate the stromal and immune scores of each sample and tumor purity can be evaluated using the formula reported before [26].

Construction of IRS
The survminer package [27] was applied to determine the optimal cut-off values for each immune cell fraction in the training dataset. Then, the leukocyte fraction level was scored as 0 or 1; a leukocyte fraction level of 1 was assigned when the fraction of one type of leukocyte was more than the corresponding cut-off value; otherwise the fraction level was 0. Thus, the 22 leukocytes were then analyzed as binary variables. To minimize the risk of overfitting, a cox proportional hazards regression model combined with the least absolute shrinkage and selection operator (LASSO) [28] was applied to identify the most important prognostic immune cells, and the optimal values of the penalty parameter λ were determined by tenfold cross-validations at 1 SE beyond the minimum partial likelihood deviance in the training dataset [29]. An IRS model was constructed based on the selected immune cells using lasso cox regression coefficients derived from the training cohort. To separate patients into low-or high-IRS groups, the optimal IRS cutoff was also generated based on the association between IRS and OS using the survminer package.

Validation of the IRS
The predictive value of IRS for OS was evaluated in all patients and in subgroups stratified by age, gender, TNMstage and tumor purity in the training dataset, validation dataset-1 and validation dataset-2 with univariate cox analysis. We also combined IRS with other available variables in multivariate analysis (MVA).

Establishing and validation of immune clinical score
According to the results of MVA in the training dataset, IRS, age and stage were significantly associated with OS. Thus, we integrated IRS, age, and stage to composite an immune clinical score (ICS) using cox proportional hazards regression in the training dataset. Stage was treated as continuous variable: stage I was assigned as 1; II, as 2; III, as 3; and IV, as 4. The prognostic performance of continuous ICS was compared with that of the IRS in terms of C-index. Meanwhile, the sensitivity and specificity of the OS prediction based on the IRS and ICS were evaluated using a time-dependent receiver operating characteristic (ROC) curve [30]. Similar to the aforementioned method for determining the optimal cutoff of IRS, the cutoff value for ICS was also generated using the survminer package. Restricted mean survival time (RMST) represents the life expectancy at 10 years for training dataset and validation dataset-1 and at 4 years for validation dataset-2 because of shorter follow-up time. The performance of binary IRS and ICS was evaluated in terms of the RMST ratio between low-and high-risk groups [31]. Accordingly, a higher RMST ratio indicates a larger prognostic difference.

Statistical analysis
All statistical analysis was conducted using R software (version 3.5.3) and SPSS software (version 25.0). The correlations between the IRS and mRNA expression level of corresponding genes were analyzed using Pearson's correlation test. Gene set enrichment analysis (GSEA) was used to identify the pathways that were significantly enriched in high-IRS and low-IRS groups [32]. Kaplan-Meier method was used to generate survival curves and significance of differences was compared using the log-rank test. Hazard ratios for univariate analysis were calculated using univariate cox proportional hazards regression model. The RMST ratio was estimated with survRM2 package [33]. All statistical tests were twosided and P values less than 0.05 were considered statistically significant.

Patient characteristics
The patient selection criteria and workflow chart are shown in Fig. 1. After applying data filter scheme, 1337 LUADs were used for further analysis. The mean age at diagnosis was 65.20 years and 657 (49.14%) patients were male. Most patients (88.33%) were early stage (stage I or II) diseases and the mean tumor purity was 0.50. Detailed patient characteristics are listed in Table 1.

Derivation of the IRS
The optimal cut-off values were generated for 22 leukocytes in the training cohort (Additional file 1: Table S1). LASSO cox regression analysis was used to build an IRS model (Fig. 2a). Six leukocyte subsets were identified to calculate the IRS as following: Patients in the training cohort were then assigned into a high-IRS group (215 patients) and low-IRS group (305 patients) by the cut-off value (− 0.1652727). The Kaplan-Meier curve showed the patients in the high-IRS group presented with a significantly worse OS in the training dataset (HR 2.77, 95% CI 2.78-3.67, p < 0.01) (Fig. 2b). The median OS was 11.19 years in low-IRS group vs. 4.47 years in the high-IRS group (p < 0.01). The association between the IRS and OS was further investigated in the multivariable Cox regression model (HR: 10.71, 95% CI 5.72-20.07) ( Table 2).

Validation of IRS for predicting overall survival in the validation dataset-1 and validation dataset-2
To ensure that the constructed IRS possessed predictive value for OS in different cohorts, the same formula derived from the training cohort was applied to the validation dataset-1 and validation dataset-2. Patients were assigned to high-or low-IRS group by the cut-off values acquired from the corresponding cohort (validation dataset-1, 0.04828553; validation dataset-2, − 0.03803774). In the validation dataset-1, 183 patients were assigned into low-IRS group and 253 patients were assigned into high-IRS group. As for validation dataset-2, 109 patients were assigned into low-IRS group and 272 patients were assigned into high-IRS group. Consistent with the findings in the training cohort, patients in the high-IRS group presented with a significantly worse OS than those in the low-IRS group in the validation dataset-1 (HR 1.56, 95% CI 1.20-2.02) (Fig. 2c) and the validation dataset-2 (HR 1.83, 95% CI 1.24-2.78) (Fig. 2d). The median OS was 7.81 years in low-IRS group vs. 6.16 years in the high-IRS group in the validation dataset-1 and 4.30 years in low-IRS group vs. 3.48 years in the high-IRS group in the validation dataset-2 (both p < 0.01). The IRS remained as an independent prognostic factor in MVA, after adjusting for clinical characteristics such as age, gender, TNM stage and purity in validation dataset-1 (HR 2.68, 95% CI 1.15-6.27) and the validation dataset-2 (HR 3.71, 95% CI 1.33-10.33) (both p < 0.05) ( Table 2).

The IRS was associated with OS in early stage patients
To further investigate the impacts of clinical characteristics on the prognostic values of the IRS, we conducted stratified analysis according to the baseline characteristics. As shown in Table 3, LUADs were stratified by available baseline characteristics (including age, gender, TNM stage and tumor purity). According to the results of stratified analysis, the IRS discriminated patients with early-stage (I and II) LUAD into significantly different prognostic groups in training dataset, validation dataset-1 and validation dataset-2 (all p < 0.01) ( Table 3). When considering LUADs with stage I disease only, the IRS remained highly prognostic for the meta-overall dataset (combined HR: 2.01, 95% CI 1.26-3.22; p < 0.01) (Additional file 1: Fig. S1).

Biological phenotypes associated with the IRS model
Gene expression data were analyzed to investigate the potential biological phenotypes associated with the IRS model in the training dataset. Firstly, we specially focused on some immune check points and the correlation plot depicted in Fig. 3a showed that the IRS was significantly positively correlated to the expression levels  of PD1, PDL1, CTLA and LAG3 (all p < 0.001). Secondly, as for some immune-activated related transcripts such as GZMA, GZMB, CXCL10 and IFNG, IRS was also significantly positively correlated to the expression levels of them (all p < 0.001) (Fig. 3b).Finally, we performed GSEA to illuminate the biological functions of the IRS model. The results showed that in the high-IRS group genes were significantly enriched in multiple biological processes such as cell cycle pathway and p53 signaling pathway, while in the low-IRS group genes were associated with the metabolism-related gene set, including fatty acid metabolism and propanoate metabolism (Fig. 3c).

Integrated prognostic score combining the IRS with clinical factors
In MVA (  Table S2). The prognostic accuracy of the ICS as a continuous variable was also evaluated by time-dependent ROC analysis (Fig. 4a). An optimal cutoff of 2.135404 for stratifying patients was determined in the training dataset. Similar results were observed in binary form of the ICS compared with the IRS (RMST ratio, 1.56 vs. 1.47 in the training dataset) ( Table 4 and Fig. 5a).

Discussion
In this study, we developed an immune prognostic signature based on the 6 leukocytes and validated it in two independent datasets from different platforms. The results showed a significantly discriminative ability of OS between patients with high-and low-IRS. In addition, IRS can further stratify clinically defined groups of patients (especially early-stage) into subgroups with different survival outcomes. The IRS was significantly positively correlated to the expression levels of some immune  We further investigated the complementary value of IRS and clinical characteristics and found that integrating both could give a more accurate estimation of OS for patients with LUAD. In recent years, immune profiling studies have taken up a research focus in cancer study [9]. In LUAD, several studies have explored the association between tumorinfiltrating lymphocytes and patients' survival. High CD4 + T cell in stroma correlated with longer OS [34] and disease-specific survival [35] and plasma cell infiltration was related to worse prognosis in LUAD patients [36]. It has been argued that macrophages may have a potential role in lung cancer by supporting both host-defense and tumor progression [37]. Mast cells were regarded a double-edged sword in cancer immunity: a higher density of mast cells was reported to correlate with improved survival in patients with LUAD [38] but activated mast cells presented with potential to exert immunosuppressive effects [39] and Takanami et al. [40] found increased mast cell infiltration in LUAD was associated with worse prognosis. Neutrophils represented a significant portion of infiltrating inflammatory cells and high neutrophil density was associated with a higher risk of relapse [41] and was a negative prognostic factor in LUAD [42]. To explore the potential role of tumor-infiltrating lymphocytes may require investigation comprehensively in tumor microenvironment.
Several models [43][44][45] based on immune cells have already been reported to present with strong ability for predicting prognosis in various types of tumors. Immunohistochemistry (IHC) is an important means of investigating tumor immune micro-environment [46] in these studies. But IHC suffers from limitations in available phenotypic markers [47] and provides only a snap shot of the tumor IME assayed on the slide [17]. In addition, a standardized measurement criterion of the intensity of protein staining, and subsequently quantitation of protein expression, was also difficult for IHC in nature [48].
As an alternative, continuously accumulating public genomic data provided an ideal resource for large-scale analysis of the immune landscape, and multiple computer-based algorithms have already been developed to perform such analysis [49]. The candidate immune cells used to construct the IRS were quantified based on a high-throughput gene expression data using bioinformatics tool CIBERSORT. By applying this computational method to public genomic data, it was possible to overcome some technical limitations of IHC and give an expanded insight into the immune profile in tumor. With further use of LASSO Cox regression model, functioning as a statistical method for screening prognosis-related immune cells to construct the IRS model, the predictive ability could be enhanced significantly [50][51][52]. We compared our results with reports from Yang et al. and found that our IRS model using much less immune cell The correlation between IRS and immune-activated related transcripts and y axis represents the expression levels of certain genes. c Gene set enrichment analysis delineates biological pathways between high-and low-IRS groups. IRS immune risk score types showed better predictive ability than IRS model from Yang [53] (mean 5-year AUC of 0.66 vs. 0.62). In addition, we included tumor purity into our analysis to adjust IRS, enabling our results more reliable. The value of the IRS was confirmed in two non-overlapping validation cohorts, indicating its excellent reproducibility for LUAD.
Patients with early-stage lung cancer are also at substantial risk for recurrence and death [54], even after complete surgical treatment and the use of adjuvant therapy in early-stage lung cancer remains controversial. An important finding in our analysis was that the IRS was significantly associated with OS in stage I/II LUAD patients. The prognostic role of IRS was also confirmed in patients with stage I for meta-overall dataset, implying IRS may provide a powerful prognostic indicator for selecting potential patients benefiting from additional therapy. FDA has approved IHC PDL1 expression as predictive biomarker for response to anti-PD1 therapy for patients with NSCLC [55,56]. Since our study revealed obvious enrichment of multiple immune checkpoint markers and immune activation transcripts, especially PDL1, in the high-IRS group, it is reasonable to speculate that immunotherapy might also be a preferable choice for patients in this group. Although patients in high-IRS group presented with poor OS, the application of immunotherapy may bring potential survival benefit. Further studies are warranted to explore whether the IRS model can predict the response of patients with LUAD to   immunotherapy. The ICS integrating the IRS and baseline characteristics not only helped clinicians predict patient outcomes more precisely but boosted its procedure for translation into clinic utility. This study has some limitations. First, as all patients in this study were selected retrospectively, the potential bias relating to unbalanced clinical features with treatment heterogeneity cannot be avoided. Secondly, the gene expression profiles used here were all derived from a core sample of tumor tissue, making it impossible for the location of the immune cell to be taken into consideration when establishing the prognostic IRS model. Thirdly, only signatures validated in independent cohorts of patients with full clinical annotation available could be applied clinically, and thus further investigations should focus on clinical validation for IRS, which may provide more evidence for its translation into clinical practice.

Conclusion
In conclusion, our study demonstrates the utility of consideration of tumor infiltrating leukocytes in the prognosis prediction of LUAD and may provide additional information and strategies for immunotherapy. Prospective studies are needed to further test its analytical accuracy for estimating prognosis and to validate its clinical utility in individualized management of LUAD patients.