A multi-parametric prognostic model based on clinical features and serological markers predicts overall survival in non-small cell lung cancer patients with chronic hepatitis B viral infection

Background To establish and validate a multi-parametric prognostic model based on clinical features and serological markers to estimate the overall survival (OS) in non-small cell lung cancer (NSCLC) patients with chronic hepatitis B viral (HBV) infection. Methods The prognostic model was established by using Lasso regression analysis in the training cohort. The incremental predictive value of the model compared to traditional TNM staging and clinical treatment for individualized survival was evaluated by the concordance index (C-index), time-dependent ROC (tdROC) curve, and decision curve analysis (DCA). A prognostic model risk score based nomogram for OS was built by combining TNM staging and clinical treatment. Patients were divided into high-risk and low-risk subgroups according to the model risk score. The difference in survival between subgroups was analyzed using Kaplan–Meier survival analysis, and correlations between the prognostic model, TNM staging, and clinical treatment were analysed. Results The C-index of the model for OS is 0.769 in the training cohorts and 0.676 in the validation cohorts, respectively, which is higher than that of TNM staging and clinical treatment. The tdROC curve and DCA show the model have good predictive accuracy and discriminatory power compare to the TNM staging and clinical treatment. The prognostic model risk score based nomogram show some net clinical benefit. According to the model risk score, patients are divided into low-risk and high-risk subgroups. The difference in OS rates is significant in the subgroups. Furthermore, the model show a positive correlation with TNM staging and clinical treatment. Conclusions The prognostic model showed good performance compared to traditional TNM staging and clinical treatment for estimating the OS in NSCLC (HBV+) patients.


Background
At present, lung cancer is the leading cause of cancer morbidity and mortality worldwide [1]. Non-smallcell lung cancer (NSCLC) accounts for 75-80% of all lung malignancies [2]. The 5-year survival of NSCLC patients is generally poor because of late diagnosis, frequent relapse, and the lack of effective systemic therapy [3].
Hepatitis B virus (HBV) is one of the most prevalent and most serious types of viral hepatitis, and the prevalence of HBV in China is high [4]. Therefore, it is reasonable to hypothesize that a HBV infection may be an important comorbidity factor in NSCLC patients in China. Previous studies have shown that HBV associated with several extra-hepatic cancers [5][6][7], In addition, diffuse large B-cell lymphoma [8] and multiple myeloma [9] patients with HBV infection have poor survival outcomes compared to non-infected patients. Together, these results implied that NSCLC patients with HBV infection should be distinguished from uninfected patients because they have different clinical characteristics, outcomes and prognostic factors. This may aid in the development of a distinct prognostic predictive model for NSCLC patients with HBV infection.
Currently, the TNM (tumor, lymph node, metastasis) stage is a widely used staging system for predicting the outcome of NSCLC patients [10]. However, patients within a similar TNM stage show different genetic, cellular, and clinicopathological characteristics, and exhibit a wide spectrum of clinical survival outcomes. This indicates the need for additional prognostic factors to complement the TNM staging to better predict the outcome of the NSCLC patients [11][12][13]. Therefore, many studies have reported some prognostic factors that might improve the predict the survival of NSCLC patients [14][15][16]. Together, these findings could help identify patients that would benefit from novel therapeutic strategies or, alternatively, if additional treatment methods need to be pursued.
Thus, the present retrospective study aimed to develop and validate a multi-parametric prognostic model based on clinical features and serological markers to estimate the overall survival (OS) in NSCLC HBV (+) patients and assess its incremental value to the traditional staging system and clinical treatment for the estimation of OS.

Patient selection and data collection
First diagnosed NSCLC (HBV+) patients who were treated at the Sun Yat-sen University Cancer Center (Guangzhou, China) between January 2008 and December 2010 were retrospectively enrolled in this study. This study was approved by the Hospital Ethics Committee in Sun Yat-sen University Cancer Center (Guangzhou, China). The inclusion criteria were as follows: (a) pathological evidence of NSCLC; (b) patients without pathological diagnosis or with previous or concomitant malignancies; (c) positive for hepatitis B surface antigen (HbsAg); (d) no co-infected other types of hepatitis viruses; (e) complete baseline clinical information, laboratory, and follow-up data.
The following relevant clinical and serological data were collected for each enrolled patient at the time of diagnosis and before any treatment: age, gender, family history, body mass index (BMI), tumor size, clinical treatment, Tumor Node Metastasis stage (TNM stage) [17], white blood cells (WBC), neutrophils (N), lymphocytes (L), platelet (PLT), hepatitis B surface antigen (HbsAg), hepatitis B surface antibody (HBsAb), hepatitis B envelope antigen (HBeAg), hepatitis B envelope antibody(HBeAb), hepatitis B core antibody (HBcAb), hepatitis B core antigen (HBcAb), albumin (ALB), alkaline phosphatase (ALP), apolipoprotein AI (APOA), apolipoprotein B (APOB), C-reactive protein (CRP), lactic dehydrogenase (LDH), glutamyl transpeptidase (GGT), total bilirubin (TBIL), and direct bilirubin (DBIL). The NLR represented the ratio of neutrophils to lymphocytes ratio [18]; the PLR represented the ratio of platelets to lymphocytes [18]; the SLR was the ratio of aspartate aminotransferase (AST) to alanine transaminase (ALT) [19]; ABR was the ratio of APOA to APOB [20]; CAR was the ratio of CRP to ALB ratio [21]; prognostic index (PI): score 0 for CRP 10 mg/L or less and a WBC count of 11 × 10 9 /L or less, patients with only one of these abnormalities were allocated a score of 1, and patients with an elevation of both levels were elevated were allocated a score of 2 [22]. The prognostic nutritional index (PNI) was calculated according to the following formula: Alb (g/L) + 5 × lymphocyte count × 10 9 /L: score 0 for PNI > 45; score 1 in patients with PNI ≤ 45 [23]. The Glasgow prognostic score (GPS) was classified as follows: patients with serum CRP > 10 mg/L and albumin < 35 g/L were classified as GPS 2; patients with CRP > 10 mg/L or albumin < 35 g/L were classified as GPS 1; patients with serum CRP ≤ 10 mg/mL and albumin > 35 g/L were classified as GPS 0 [24].

Patients follow up
Follow-up of patients' survival data was obtained by means of retrieving medical records, email, and direct communication by phone. All patients were followed up until death or January 2016. The endpoint of this study was overall survival (OS), which was defined as the time interval from diagnosis to the date of the patient's death or censored at the date of the last follow-up.

Statistical analyses
Statistical analyses were performed using IBM SPSS Statistical software version 19.0 (IBMCorp., Chicago, IL, USA) and R version 3.6.0 (http://www.R-proje ct.org). Categorical variables were classified based on clinical findings, and continuous variables were transformed into categorical variables based on the cut-off values of by the R package "survival" [25] and "survminer". Differences in distribution between patients in the training cohort and validation cohort were analyzed by Chi-square test. The Lasso regression analysis was utilized to select the most useful prognostic variables in the training cohort. According to the regulation weight λ, LASSO shrinks all regression coefficients towards zero and sets the coefficients of many irrelevant features to zero. The optimal values of the penalty parameter λ were determined by tenfold cross validation with the 1 standard error of the minimum criteria (the 1-SE criteria), where the final value of λ yielded a minimum cross validation error. Retained features with nonzero coefficients were used for regression model fitting [26,27]. Next, a prognostic computing-based model was established for each patient through a linear combination of selected variables weighted by their respective coefficients. The R package "glmnet" was used for Lasso regression analysis. The incremental predictive value of the prognostic model to the traditional TNM staging and clinical treatment for individualized survival was evaluated by the Harrell's concordance index (C-index), time-dependent ROC (tdROC), and decision curve analysis [28]. The area under the curve (AUC) was calculated using the "sur-vivalROC" package [29], and the C-index was computed and compared by using the "survcomp" package [30]. A nomogram (by the package of rms in R) was developed using the prognostic model risk score, TNM staging, and clinical treatment. Performance was assessed by the calibration curve in internal validation with bootstrapping (1000 bootstrap resamples) [31]. For subsequent comparison, patients were divided into high-and lowrisk groups basing on the optimal cut-off value of the prognostic model risk score, and Kaplan-Meier survival analyses and log-rank tests were used to assess differences in OS between patients in the predicted high-and low-risk groups. The correlation between the prognostic model and TNM staging or clinical treatment was evaluated by the Pearson's correlation coefficient [32]. Results with two-sided p values of < 0.05 were considered statistically significant.

Patient characteristics
In this study, a total of 201 eligible patients are analyzed:  Table 1. No clinical and serological parameters, except for ALB, PLR, HBeAg, HBeAb, and HBcAb have a significantly different distribution in the training cohort and validation cohort.

Construction of the multi-parametric prognostic model based on clinical and serological markers
To select prognostic clinical and serological markers, the Lasso regression analysis is performed based on the OS in the training cohort. Figure 1a shows the change in trajectory for each factor analyzed. Moreover, tenfold cross-validation is used for model establishment, and the confidence interval under each λ is presented in Fig. 1b. The optimal value of λ is 0.046 in the Lasso regression analysis. Thus, this value is selected as the final model, and including 10 predictors from the 34 markers that are   Fig. 1c. Subsequently, a multi-parametric prognostic model based on clinical and serological markers is constructed using the coefficients derived from the Lasso regression analysis. Next, a prognostic model risk score is calculated based on the personalized levels of the 10 predictors, by using the following formula: the prognostic model risk score In this formula, each variable level is valued as 0 or 1; a value of 0 is assigned when the marker is less than or equal to the corresponding cut-off value, otherwise a value of 1 is assigned.

Assessment of performance of prognostic model and verification
The C-index is used to estimate the discrimination performance between the prognostic model and TNM staging or clinical treatment. The results are presented in Table 2. In the training cohort, the C-index for the prognostic model is 0.769 (95% confidence interval (CI) 0.721-0.817), which is higher than that of TNM staging (0.710, 95% CI 0.661-0.758, P = 0.079), and clinical treatment (0.694, 95% CI 0.643-0.746, P = 0.017). Moreover, then compare to either the TNM staging or the clinical treatment, the prognostic model shows a better discrimination capability in the validation cohort with higher C-indexes. The prognostic accuracy of the prognostic model and TNM staging or clinical treatment in these cohorts is also assessed using tdROC analysis (Fig. 2) In addition, decision curve analysis (Fig. 3) shows that the prognostic model have a higher overall net benefit compare to traditional TNM staging and clinical treatment across the majority of the range of reasonable threshold probabilities in the training cohort and validation cohort.

Construction of the prognostic model risk score based nomogram
In this study, we built a nomogram that consist of the prognostic model risk score, TNM staging, and clinical treatment to predict 1-, 3-, and 5-year OS in the training cohort and validation cohort (Fig. 4a). Within the variables, each subtype is assigned a point. For example, locate the patient's model risk score, draw a line straight upward to the "Points" axis to determine how many points associated with that model risk score. The process is repeated for each variable, the points achieved for each covariate are summarized, and the sum on the "Total Point" axis is located. Finally, a line is drawn straight down to identify the patient's probability of OS at 1-, 3-, and 5-year. The calibration plots for the probability of survival at 1-, 3-, and 5-year show a good match between the prediction by the nomogram and the actual observation (Fig. 4b-d).

Performance of the prognostic model risk score in stratifying patient risk
The optimum cut-off value of the model risk is − 0.12 (Fig. 5). Next, patients are divided into 2 subgroups (Table 3) Subsequently, Kaplan-Meier survival analysis is performed according to the stratified subgroup (Fig. 6a). Kaplan-Meier curves show that significant differences are observed in survival distributions in the stratified subgroup in the training cohort. Similar results are observed in the validation cohort. Furthermore, stratified analyses of NSCLC HBV (+) patients with a respective stage I/II, and III/IV are performed (Fig. 6b, c). In the training cohort, the stratification by the prognostic model risk score result in significant differences in Kaplan-Meier OS curves for patients in each stage group. Furthermore, for the

Discussion
In the present study, we first analyzed individual clinical features and serological markers based on the survival analysis approach. Then, a multi-parametric prognostic model was generated by using the Lasso regression model for predicting the OS in NSCLC HBV (+) patients.
Our prognostic model showed better predictive accuracy and discriminative ability compared to traditional TNM staging and clinical treatment. The prognostic model signature successfully stratified those patients into high-risk and low-risk subgroups with significant differences in OS.
According to the results of Lasso regression analysis, the present prognostic model consisted of 10 prognostic factors: age, BMI, tumor size, PLT, PLR, ALT, GGT, LDH, TBIL, and APOA. Of the 10 prognosis-specific factors, all had been reported to be associated with OS in lung cancer patients [33][34][35][36][37][38][39][40][41][42][43]. These findings suggested that our results had credible prognostic value. We next compared the predictive accuracy of the prognostic model with the traditional TNM staging and clinical treatment. The data showed that the C-index of the prognostic model was higher compared to that of TNM staging and clinical treatment in the training cohort. TdROC curve analysis showed that our prognostic model exhibited good accuracy in clinical outcome prediction either for 1-year survival (AUC = 0.857), 3-year survival (AUC = 0.845), and 5-year survival (AUC = 0.879) of NSCLC HBV(+) patients in the training cohort when compared with traditional TNM staging and clinical treatment. Furthermore, the decision curve analysis showed that the prognostic model had good performance in prognosis prediction compared to TNM staging and clinical treatment in the training cohort. In the validation cohort, results were observed that were similar to the findings mentioned above.
To complement the shortcomings of current TNM staging in the prognostic assessment of NSCLC HBV (+) patients, the prognostic model risk score of patients was calculated, and prediction and verification were carried out. The results showed that the prognostic model risk score successfully classified patients into high-risk and low-risk subgroups within stages I/II and III/IV, and that high-risk patients had poor survival outcomes. Therefore, even between patients in the same stage, highrisk patients needed more intensified treatment. These results implied that the prognostic model could reinforce the prognostic ability of TNM staging, and the improved prediction of individual outcomes would be useful for counselling patients, personalizing treatment, and scheduling patients' follow-up. Of note, significant positive correlations were observed among the prognostic model, TNM staging, and clinical treatment, thereby suggesting that the prognostic model could be useful in predicting the outcomes of NSCLC HBV (+), and might be useful in treatment decisions.
Compared to previous studies [44,45], this study had the following advantages: (1) To increase prognostic  accuracy, many potential prognostic factors have been assessed. The potential prognostic factors included in this study were more than presented in previously studies. (2) We developed a prognostic model using the new algorithm Lasso regression analysis, as a statistical method for screening variables to establish a prognostic model, which enabled to adjust for model's over fitting and avoid extreme predictions. Thus, the predictive accuracy could be significantly improved, and this approach was applied in many study [27,46,47]. (3) The prognostic model was different from that presented in previous studies because the prognostic model did not include TNM staging. Therefore, whether it can be used for patients with TNM staging is unclear. Moreover, the C-index of the prognostic model was approximately equivalent or even higher than the previously reported model. (4) For further research, continuous variables need to be transformed into categorical variables based on the cut-off values. There were some limitations in choosing the cut-off values for continuous variables, because the cut-off values were determined by analyzed data, and different data have different cut-off values. To overcome this limitation, in this study, the continuous variables did not need to be transformed into categorical variables. Thus, this was convenient for other center applications.
However, some limitations in our study should be considered. First, this was a retrospective study, and therefore, the retrospective nature of this study cannot exclude all potential bias. Second, our endpoint was OS, and further research on the disease-free survival (DFS) should also be conducted. Third, other predictive biomarkers, such as radiomics features [48], carcinoembryonic antigen (CEA) [49], cytokeratin 19 fragment (CYFRA21-1) [49], epidermal growth factor receptor (EGFR) [50], circulating tumor cells [51], and circulating cell-free DNA [52] were not analyzed in the current study. Finally, analysis was from data obtained from a single cancer center, and the sample size was small. In the future, a large-scale, multicenter validation of the results will be required. Despite the above-mentioned shortcomings, the prognostic model was effective and may be useful in predicting the outcomes of NSCLC HBV (+ ) patients.

Conclusions
In conclusion, this study provided a multi-parametric prognostic model derived from clinical features and serological markers that showed favorable performance when compared to traditional TNM staging and clinical treatment for individualized OS estimation. The nomogram based on the prognostic model, TNM staging, and clinical treatment can reinforce the prognostic ability of TNM staging. Therefore, this simple, precise and understandable prognostic model may serve as a potential tool for clinicians in counselling patients, personalizing treatment, and scheduling the follow-up for NSCLC HBV (+) patients. The optimal cut-off value of prognostic model risk score using R package "survival"