Advanced Search

Methods and Applications: An Improved Training Algorithm Based on Ensemble Penalized Cox Regression for Predicting Absolute Cancer Risk

View author affiliations
  • Abstract

    Introduction

    Biases in cancer incidence characteristics have led to significant imbalances in databases constructed by prospective cohort studies. Since they use imbalanced databases, many traditional algorithms for training cancer risk prediction models perform poorly.

    Methods

    To improve prediction performance, we introduced a Bagging ensemble framework to an absolute risk model based on ensemble penalized Cox regression (EPCR). We then tested whether the EPCR model outperformed other traditional regression models by varying the censoring rate of the simulated data.

    Results

    Six different simulation studies were performed with 100 replicates. To assess model performance, we calculated mean false discovery rate, false omission rate, true positive rate, true negative rate, and the areas under the receiver operating characteristic curve (AUC) values. We found that the EPCR procedure could reduce the false discovery rate (FDR) for important variables at the same true positive rate (TPR), thereby achieving more accurate variable screening. In addition, we used the EPCR procedure to build a breast cancer risk prediction model based on the Breast Cancer Cohort Study in Chinese Women database. AUCs for 3- and 5-year predictions were 0.691 and 0.642, representing improvements of 0.189 and 0.117 over the classical Gail model, respectively.

    Discussion

    We conclude that the EPCR procedure can overcome challenges posed by imbalanced data and improve the performance of cancer risk assessment tools.

  • loading...
  • Funding: Supported by the China Postdoctoral Science Foundation (grants 2021M691911 and 2021M701997); National Key Research and Development Program of China (2016YF0901301); and the General programs of Natural Science Foundation of Shandong Province (ZR2021MH243)
  • [1] Maloof MA. Learning when data sets are imbalanced and when costs are unequal and unknown. In: ICML-2003 workshop on learning from imbalanced data sets II. Washington: ICLM. 2003. https://www.site.uottawa.ca/~nat/Workshop2003/maloof-icml03-wids.pdf.https://www.site.uottawa.ca/~nat/Workshop2003/maloof-icml03-wids.pdf
    [2] Breiman L. Bagging predictors. Mach Learn 1996;24(2):123-40. http://dx.doi.org/10.1023/A:1018054314350CrossRef
    [3] Liang G, Zhang C. Empirical study of bagging predictors on medical data. In: Conferences in research and practice in information technology series. Ballarat, Australia: OPUS. 2010; p. 31-400 https://opus.lib.uts.edu.au/handle/10453/19124.
    [4] Dudoit S, Fridlyand J. Bagging to improve the accuracy of a clustering procedure. Bioinformatics 2003;19(9):1090-9. http://dx.doi.org/10.1093/bioinformatics/btg038CrossRef
    [5] Cox DR. Regression models and life-tables. J Roy Stat Soc B Methodol 1972;34(2):187-220. http://dx.doi.org/10.1111/j.2517-6161.1972.tb00899.xCrossRef
    [6] Cox DR. Partial likelihood. Biometrika 1975;62(2):269-76. http://dx.doi.org/10.1093/biomet/62.2.269CrossRef
    [7] Zou H, Hastie T. Regularization and variable selection via the elastic net. J Roy Stat Soc B StatMethodol 2005;67(2):301-20. http://dx.doi.org/10.1111/j.1467-9868.2005.00503.xCrossRef
    [8] Gui J, Li HZ. Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data. Bioinformatics 2005;21(13):3001-8. http://dx.doi.org/10.1093/bioinformatics/bti422CrossRef
    [9] Bao HL, Liu LY, Fang LW, Cong S, Fu ZT, Tang JL, et al. The Breast Cancer Cohort Study in Chinese Women: the methodology of population-based cohort and baseline characteristics. Chin J Epidemiol 2020;41(12):2040-5. http://dx.doi.org/10.3760/cma.j.cn112338-20200507-00695 (In Chinese). CrossRef
    [10] Gail MH, Brinton LA, Byar DP, Corle DK, Green SB, Schairer C, et al. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J Natl Cancer Inst 1989;81(24):1879-86. http://dx.doi.org/10.1093/jnci/81.24.1879CrossRef
    [11] Chen HL, Huang CC, Yu XG, Xu X, Sun X, Wang G, et al. An efficient diagnosis system for detection of Parkinson’s disease using fuzzy k-nearest neighbor approach. Expert Syst Appl 2013;40(1):263-71. http://dx.doi.org/10.1016/j.eswa.2012.07.014CrossRef
    [12] Mohan S, Thirumalai C, Srivastava G. Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 2019;7:81542-54. http://dx.doi.org/10.1109/ACCESS.2019.2923707CrossRef
    [13] Yu W, Liu TB, Valdez R, Gwinn M, Khoury MJ. Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Med Inf Decis Making 2010;10(1):16. http://dx.doi.org/10.1186/1472-6947-10-16CrossRef
    [14] Alelyani S. Stable bagging feature selection on medical data. J Big Data 2021;8(1):11. http://dx.doi.org/10.1186/S40537-020-00385-8CrossRef
    [15] Han YT, Lv J, Yu CQ, Guo Y, Bian Z, Hu YZ, et al. Development and external validation of a breast cancer absolute risk prediction model in Chinese population. Breast Cancer Res 2021;23(1):62. http://dx.doi.org/10.1186/s13058-021-01439-2CrossRef
  • FIGURE 1.  Box plots of AUC values for each modeling method. Data show boxplots of 100 replicates of settings 1–6. (A) n=1,000, p=100, 30% censoring; (B) n=1,000, p=100, 50% censoring; (C) n=1,000, p=100, 70% censoring; (D) n=1,000, p=50, 30% censoring; (E) n=1,000, p=50, 50% censoring; (F) n=1,000, p=50, 70% censoring.

    Abbreviation: EPCR=Ensemble penalized Cox regression; PCR=Penalized Cox regression; AUC=Areas under the receiver operating characteristic curve; EN=Elastic net; AIC=Akaike Information Criterion; BIC=Bayesian Information Criterion; LASSO=Least absolute shrinkage and selection operator.

    FIGURE 2.  The ROC curve for 3- and 5-year model predictions of disease onset. (A) 3-year ROC; (B) 5-year ROC.

    Note: Red indicates the ROC curve of the EPCR model, orange indicates the ROC curve of the PCR model, and lime green indicates the ROC curve of the Gail model.

    Abbreviation: ROC=receiver operating characteristic; EPCR=ensemble penalized Cox regression; PCR=penalized Cox regression; AUC=the areas under the receiver operating characteristic curve.

    TABLE 1.  The mean values of 5 metrics for the 6 models over 100 replicate experiments for each simulation setting.

    MethodFDRFORTPRTNRAUC
    Setting 1: 30% censoring
    Traditional approach
    Stepwise-AIC*0.7660.1270.3440.7950.721
    Stepwise-BIC0.1730.1250.2010.990.733
    PCR-LASSO0.2750.0090.9520.9220.863
    PCR-EN ($ \mathrm{\alpha }=0.5 $)0.3750.0050.9730.8780.873
    Ensemble approach
    EPCR-LASSO§0.1110.0110.9360.9770.878
    EPCR-EN ($ \alpha =0.5 $)0.2020.0070.9630.9520.878
    Setting 2: 50% censoring
    Traditional approach
    Stepwise-AIC0.7950.1340.3170.7790.704
    Stepwise-BIC0.2390.1300.1680.9860.704
    PCR-LASSO0.3210.0110.9390.9070.858
    PCR-EN ($ \alpha =0.5 $)0.4070.0070.9650.8640.869
    Ensemble approach
    EPCR-LASSO0.1690.0170.9030.9640.865
    EPCR-EN ($ \alpha =0.5 $)0.2550.0120.9370.9360.874
    Setting 3: 70% censoring
    Traditional approach
    Stepwise-AIC0.8090.1400.2990.760.690
    Stepwise-BIC0.3010.1360.1240.9840.678
    PCR-LASSO0.3680.0180.9050.8920.842
    PCR-EN ($ \alpha =0.5 $)0.460.0110.9450.8420.855
    Ensemble approach
    EPCR-LASSO0.2420.0280.8430.9450.864
    EPCR-EN ($ \alpha =0.5 $)0.3480.0180.9030.9040.872
    Setting 4: 30% censoring
    Traditional approach
    Stepwise-AIC0.5550.2600.3370.8090.733
    Stepwise-BIC0.0980.2580.1990.9870.732
    PCR-LASSO0.1910.0200.9550.8880.858
    PCR-EN ($ \alpha =0.5 $)0.2570.0100.9790.8340.882
    Ensemble approach
    EPCR-LASSO0.0930.0280.9350.9540.883
    EPCR-EN ($ \alpha =0.5 $)0.1630.0170.9630.9090.894
    Setting 5: 50% censoring
    Traditional approach
    Stepwise-AIC0.6090.2720.3150.7840.713
    Stepwise-BIC0.1210.2670.1610.9850.705
    PCR-LASSO0.2070.0270.9410.8780.853
    PCR-EN ($ \mathrm{\alpha }=0.5 $)0.2770.0160.9690.8180.867
    Ensemble approach
    EPCR-LASSO0.1150.0400.9050.9430.877
    EPCR-EN ($ \alpha =0.5 $)0.1790.0260.9410.9010.877
    Setting 6: 70% censoring
    Traditional approach
    Stepwise-AIC0.6170.2810.2810.7930.716
    Stepwise-BIC0.1270.2750.1280.9870.696
    PCR-LASSO0.2710.0470.9030.8350.836
    PCR-EN ($ \alpha =0.5 $)0.3220.0320.9400.7830.851
    Ensemble approach
    EPCR-LASSO0.1490.0660.8450.9270.862
    EPCR-EN ($ \alpha =0.5 $)0.2170.0470.8980.8750.870
    Abbreviation: EPCR=Ensemble penalized Cox regression; PCR=Penalized Cox regression; AUC=Areas under the receiver operating characteristic curve; EN=Elastic net; FDR=False discovery rate; FOR=False omission rate; TPR=True positive rate; TNR=True negative rate; AIC=Akaike Information Criterion; BIC=Bayesian Information Criterion; LASSO=Least absolute shrinkage and selection operator.
    * The method “Stepwise-AIC (BIC)” refers to fitting a Cox model using stepwise procedures based on AIC (BIC) criterion.
    The method “PCR-LASSO [EN ($ \alpha =0.5 $)]” refers to a Cox model with a LASSO-Type [EN-Type ($ \alpha =0.5 $)] penalty.
    § The method “EPCR-LASSO [EN ($ \alpha =0.5 $)]” refers to an Ensemble Penalized Cox Regression model whose base models were trained by Cox Regression algorithm with a LASSO-Type [EN-Type ($ \alpha =0.5 $)] penalty.
    Download: CSV

Citation:

通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索
Turn off MathJax
Article Contents

Article Metrics

Article views(2186) PDF downloads(25) Cited by()

Share

Related

An Improved Training Algorithm Based on Ensemble Penalized Cox Regression for Predicting Absolute Cancer Risk

View author affiliations

Abstract

Introduction

Biases in cancer incidence characteristics have led to significant imbalances in databases constructed by prospective cohort studies. Since they use imbalanced databases, many traditional algorithms for training cancer risk prediction models perform poorly.

Methods

To improve prediction performance, we introduced a Bagging ensemble framework to an absolute risk model based on ensemble penalized Cox regression (EPCR). We then tested whether the EPCR model outperformed other traditional regression models by varying the censoring rate of the simulated data.

Results

Six different simulation studies were performed with 100 replicates. To assess model performance, we calculated mean false discovery rate, false omission rate, true positive rate, true negative rate, and the areas under the receiver operating characteristic curve (AUC) values. We found that the EPCR procedure could reduce the false discovery rate (FDR) for important variables at the same true positive rate (TPR), thereby achieving more accurate variable screening. In addition, we used the EPCR procedure to build a breast cancer risk prediction model based on the Breast Cancer Cohort Study in Chinese Women database. AUCs for 3- and 5-year predictions were 0.691 and 0.642, representing improvements of 0.189 and 0.117 over the classical Gail model, respectively.

Discussion

We conclude that the EPCR procedure can overcome challenges posed by imbalanced data and improve the performance of cancer risk assessment tools.

  • 1. Department of Breast Surgery, The Second Hospital, Cheeloo College of Medicine, Shandong University, Jinan City, Shandong Province, China
  • 2. School of Mathematics, Shandong University, Jinan City, Shandong Province, China
  • 3. Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan City, Shandong Province, China
  • 4. Institute of Translational Medicine of Breast Disease Prevention and Treatment, Shandong University, Jinan City, Shandong Province, China
  • Corresponding authors:

    Jiadong Ji, jiadong@sdu.edu.cn

    Zhigang Yu, yuzhigang@sdu.edu.cn

  • Funding: Supported by the China Postdoctoral Science Foundation (grants 2021M691911 and 2021M701997); National Key Research and Development Program of China (2016YF0901301); and the General programs of Natural Science Foundation of Shandong Province (ZR2021MH243)
  • Online Date: March 03 2023
    Issue Date: March 03 2023
    doi: 10.46234/ccdcw2023.037
  • Most cancer predictions involve imbalanced binary classification datasets, i.e., the number of instances of cases is far smaller than the number of instances of controls. We are more concerned about predicting cases because misclassification of cases can be more costly (1). However, traditional supervised learning algorithms do not possess high predictive accuracy for minority classes. The “ensemble learning” approach for statistical modeling is a powerful method for generating highly accurate predictive models, in which Bagging (2-3), a simple yet effective ensemble method, has been employed in many practical applications (4). This paper proposes building an ensemble penalized Cox regression (EPCR) model for disease risk prediction and validates the accuracy of the method through numerical simulations and an empirical study on a Breast Cancer Chinese Women database.

    • We propose an ensemble penalized Cox regression (EPCR) model based on penalized Cox regression (PCR) models (58) (Supplementary Figure S1). For the original dataset ${D}=\left\{\widetilde{{{T}}_{{i}}},{{\Delta }}_{{i}},{{Z}}_{{i}}\right\}\left({i}=1,\cdots ,{n}\right)$, we first use a repeated sampling technique to generate $ \mathit{B} $ bootstrap data sets from the original data set by ${{D}}^{\left({k}\right)}={\left\{{\widetilde{{{T}}_{{i}}}}^{\left({k}\right)},{{\Delta }}_{{i}}^{\left({k}\right)},{{Z}}_{{i}}^{\left({k}\right)}\right\}}_{{i}=1}^{{n}}\left({k}=1,\cdots ,{B}\right)$. Next, a set of base learners ${\widehat{{P}}}^{\left({k}\right)}\left({a},{\tau },{Z}\right)\left({k}=1,\cdots ,{B}\right)$ are trained by the PCR algorithm independently on ${\widehat{{D}}}^{\left({k}\right)}\left({k}=1,\cdots ,{B}\right)$. More details about PCR algorithm are provided in Supplementary Materials. For each sample in the test set, EPCR achieves prediction by averaging the probability prediction values given by each of these $B $ base learners.

    • To assess the predictive accuracy of the proposed EPCR procedure and to compare its performance to alternative methods — i.e., Cox regression based on a stepwise procedure [using Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC)] or single PCR — we conducted simulation studies across a range of conditions by varying the censoring rate or dimensionality of predictors. We were particularly interested in assessing the ability of the proposed EPCR procedure to correctly identify important predictors associated with cancer as well as the accuracy of the EPCR procedure in predicting cancer risk.

      Each $ \mathit{p} $-dimensional predictor is assumed to be a continuous variable generated from a multivariate normal distribution with a mean ($ \mathit{\mu }) $ of zero and a covariance matrix $ \mathit{\Sigma }=\left({0.8}^{\left|\mathit{j}-\mathit{i}\right|}\right),\mathit{i},\mathit{j}=1,\dots \mathit{p} $. The first 15 of the $ \mathit{p} $-dimensional predictors were assumed to be genuinely associated with the onset of cancer. For simplicity, we specified the regression coefficients of the Cox model as 1.5 for the first five predictors, $ 1 $ for predictors 6–10, 0.5 for predictors 11–15, and 0 for the rest.

      By specifying different baseline hazard functions $ {\mathit{h}}_{0}\left(\mathit{t}\right) $, we can generate different survival times $ \mathit{T} $ that obey different distributions (6). To obtain this value, the survival function $ \mathit{S}\left(\mathit{t}\right) $ was first generated through a uniform distribution $ \mathit{U}\left(0,1\right) $, and $ \mathit{T} $ is then generated using the following equation:

      $$ \mathit{T}={\mathit{H}}_{0}^{-1}\left[-\mathit{log}\left(\mathit{S}\left(\mathit{t}\right)\right)\mathit{exp}\left(-{\mathit{\beta }}^{\mathit{\text{'}}}\mathit{Z}\right)\right] $$ (1)

      Here, $ {\mathit{H}}_{0}^{-1} $ denotes the inverse function of the cumulative hazard function $ {\mathit{H}}_{0}\left(\mathit{t}\right)={\int }_{0}^{\mathit{t}}{\mathit{h}}_{0}\left(\mathit{u}\right)\text{d}\mathit{u} $. For simplicity, we specify $ {\mathit{h}}_{0}\left(\mathit{t}\right) $ as 1, at which point the survival time $T $ follows an exponential distribution. Furthermore, we generated the censoring metric $ {\Delta } $ from a Bernoulli $ \mathit{b}(0,1-\mathit{r}) $ distribution, where $ \mathit{r} $ is the censoring rate.

      Varying the dimensionality of predictors $ \mathit{p} $ and censoring rate $ \mathit{r} $, our simulation study considered the following six main settings:

      Setting 1: $n=1, 000, p=100, r=30\%$;

      Setting 2: $n=1, 000, p=100, r=50\%;$

      Setting 3: $n=1, 000, p=100, r=70\%;$

      Setting 4: $n=1, 000, p=50, r=30\%;$

      Setting 5: $n=1, 000, p=50, r=50\%$;

      Setting 6: $n=1, 000, p=50, r=70\%$.

      For each setting, bootstrap times $ B $ is specified as 200 and simulated data are split into two parts: 70% to train the models and 30% as a test dataset for comparing model performance. The simulation study was repeated 100 times for each setting. Mean values of the four evaluation metrics [“false discovery rate (FDR),” “false omission rate (FOR),” “true positive rate (TPR),” and “true negative rate (TNR)” for variable screening] were calculated to test whether the important predictors could be correctly identified by the models. Finally, the area under the receiver operating characteristic curve (AUC) was calculated for each model to test how well each model could be used for prediction of the onset of cancer.

    • To validate the disease risk prediction validity of the proposed EPCR model, we applied it to the Shandong sub-database from Breast Cancer Cohort Study in Chinese Women (BCCS-CW) (9) to develop a candidate breast cancer incidence risk predictor. The workflow of this part of the study is presented in Supplementary Figure S2.

      The onset of breast cancer was treated as the outcome event and individuals who had not yet developed breast cancer were censoring data. We considered the age of individuals with breast cancer to be the age at which the patient received the first cancer diagnosis, and the age of individuals who had not yet developed breast cancer as the age registered at baseline. We randomly selected 70% individuals from the case and control groups respectively to form a training set for model development; the remaining 30% of the control group was used as a test dataset. The EPCR procedure was performed on the training set to generate an absolute risk prediction model for breast cancer, and this was then used to estimate the probability of onset in the test group over the next three or five years. Similarly, the bootstrap times B is specified as 200. Based on actual three- and five-year follow-up results, receiver operating characteristic (ROC) curves were plotted to assess model performance, where a single PCR model and a classical Gail model (10) were used for comparison.

      All the analyses were performed in the R software (version 4.1.2; R Foundation for Statistical Computing, Vienna, Austria). Packages “glmnet” and “gbm” were used to construct the EPCR model, “pROC” was used to plot the ROC curve, and “Table 1” was used to create a demographic characteristics table. P<0.05 was considered statistically significant $ (\alpha =0.05) $.

      MethodFDRFORTPRTNRAUC
      Setting 1: 30% censoring
      Traditional approach
      Stepwise-AIC*0.7660.1270.3440.7950.721
      Stepwise-BIC0.1730.1250.2010.990.733
      PCR-LASSO0.2750.0090.9520.9220.863
      PCR-EN ($ \mathrm{\alpha }=0.5 $)0.3750.0050.9730.8780.873
      Ensemble approach
      EPCR-LASSO§0.1110.0110.9360.9770.878
      EPCR-EN ($ \alpha =0.5 $)0.2020.0070.9630.9520.878
      Setting 2: 50% censoring
      Traditional approach
      Stepwise-AIC0.7950.1340.3170.7790.704
      Stepwise-BIC0.2390.1300.1680.9860.704
      PCR-LASSO0.3210.0110.9390.9070.858
      PCR-EN ($ \alpha =0.5 $)0.4070.0070.9650.8640.869
      Ensemble approach
      EPCR-LASSO0.1690.0170.9030.9640.865
      EPCR-EN ($ \alpha =0.5 $)0.2550.0120.9370.9360.874
      Setting 3: 70% censoring
      Traditional approach
      Stepwise-AIC0.8090.1400.2990.760.690
      Stepwise-BIC0.3010.1360.1240.9840.678
      PCR-LASSO0.3680.0180.9050.8920.842
      PCR-EN ($ \alpha =0.5 $)0.460.0110.9450.8420.855
      Ensemble approach
      EPCR-LASSO0.2420.0280.8430.9450.864
      EPCR-EN ($ \alpha =0.5 $)0.3480.0180.9030.9040.872
      Setting 4: 30% censoring
      Traditional approach
      Stepwise-AIC0.5550.2600.3370.8090.733
      Stepwise-BIC0.0980.2580.1990.9870.732
      PCR-LASSO0.1910.0200.9550.8880.858
      PCR-EN ($ \alpha =0.5 $)0.2570.0100.9790.8340.882
      Ensemble approach
      EPCR-LASSO0.0930.0280.9350.9540.883
      EPCR-EN ($ \alpha =0.5 $)0.1630.0170.9630.9090.894
      Setting 5: 50% censoring
      Traditional approach
      Stepwise-AIC0.6090.2720.3150.7840.713
      Stepwise-BIC0.1210.2670.1610.9850.705
      PCR-LASSO0.2070.0270.9410.8780.853
      PCR-EN ($ \mathrm{\alpha }=0.5 $)0.2770.0160.9690.8180.867
      Ensemble approach
      EPCR-LASSO0.1150.0400.9050.9430.877
      EPCR-EN ($ \alpha =0.5 $)0.1790.0260.9410.9010.877
      Setting 6: 70% censoring
      Traditional approach
      Stepwise-AIC0.6170.2810.2810.7930.716
      Stepwise-BIC0.1270.2750.1280.9870.696
      PCR-LASSO0.2710.0470.9030.8350.836
      PCR-EN ($ \alpha =0.5 $)0.3220.0320.9400.7830.851
      Ensemble approach
      EPCR-LASSO0.1490.0660.8450.9270.862
      EPCR-EN ($ \alpha =0.5 $)0.2170.0470.8980.8750.870
      Abbreviation: EPCR=Ensemble penalized Cox regression; PCR=Penalized Cox regression; AUC=Areas under the receiver operating characteristic curve; EN=Elastic net; FDR=False discovery rate; FOR=False omission rate; TPR=True positive rate; TNR=True negative rate; AIC=Akaike Information Criterion; BIC=Bayesian Information Criterion; LASSO=Least absolute shrinkage and selection operator.
      * The method “Stepwise-AIC (BIC)” refers to fitting a Cox model using stepwise procedures based on AIC (BIC) criterion.
      The method “PCR-LASSO [EN ($ \alpha =0.5 $)]” refers to a Cox model with a LASSO-Type [EN-Type ($ \alpha =0.5 $)] penalty.
      § The method “EPCR-LASSO [EN ($ \alpha =0.5 $)]” refers to an Ensemble Penalized Cox Regression model whose base models were trained by Cox Regression algorithm with a LASSO-Type [EN-Type ($ \alpha =0.5 $)] penalty.

      Table 1.  The mean values of 5 metrics for the 6 models over 100 replicate experiments for each simulation setting.

    • Table 1 summarizes the mean values of the 5 evaluation metrics for 100 replications of each simulation setting. These results clearly show that the EPCR-least absolute shrinkage and selection operator (LASSO) model has the lowest FDR, which indicates that the model has the lowest probability of incorrectly screening out unimportant variables, while its TPR is also at a high level among all models. So EPCR-LASSO model is better able to correctly screen out important models compared to other models. Furthermore, a comparison of the EPCR-elastic net (EN) and PCR-EN models showed that the introduction of the ensemble framework was able to reduce the FDR of variable screening while maintaining a similar FOR. As shown in Figure 1, the AUCs based on the risk scores estimated by the EPCR procedure were higher than those from the other models at all six settings.

      Figure 1. 

      Box plots of AUC values for each modeling method. Data show boxplots of 100 replicates of settings 1–6. (A) n=1,000, p=100, 30% censoring; (B) n=1,000, p=100, 50% censoring; (C) n=1,000, p=100, 70% censoring; (D) n=1,000, p=50, 30% censoring; (E) n=1,000, p=50, 50% censoring; (F) n=1,000, p=50, 70% censoring.

      Abbreviation: EPCR=Ensemble penalized Cox regression; PCR=Penalized Cox regression; AUC=Areas under the receiver operating characteristic curve; EN=Elastic net; AIC=Akaike Information Criterion; BIC=Bayesian Information Criterion; LASSO=Least absolute shrinkage and selection operator.

      The censoring rate reflects the level of imbalance in the database. The higher the censoring rate, the more imbalanced the database is and the lower the percentage of cases in the database. As seen in Table 1, we observed increases in mean FDR, decreases in mean TPR and AUC for all models as the censoring rate increased; however, the EPCR-LASSO and EPCR-EN models (i.e., those that used the ensemble framework) consistently performed better than their competitors. For example, at $n=1,000 \;{\rm{and}}\; p=100$, the PCR-LASSO model’s FDR increased to 0.368 when the censoring rate was increased to 70%, meaning that more than a third of the important variables identified by the model were incorrect. In contrast, the EPCR-LASSO model was able to reduce this error by 0.146. Finally, it is worth noting that among the ensemble methods, the EPCR model with the LASSO penalty performed better overall during variable screening than the model with the elastic net $ (\alpha = 0.5) $ penalty. Both were used for prediction with comparable accuracy, and here the elastic net penalty model performed slightly better.

      For the empirical study, Supplementary Table S1 shows the baseline population characteristics of risk factors in the Shandong sub-dataset across overall, cases and controls. The proportion of cases present in this dataset was only 0.3%, which is a serious imbalance. For the EPCR model, the AUC for 3- and 5-year predictions were 0.691 and 0.642, respectively, while those are 0.502 and 0.525, respectively, for the Gail model (see Figure 2). Supplementary Figure S3 shows factor importance scores based on the EPCR model. See Supplementary Materials for details of the factor importance measures for the EPCR model. Here the red line indicates the importance score threshold that distinguished important from unimportant variables. This analysis revealed that life satisfaction, dysmenorrhea, number of miscarriages, and breastfeeding were all predicted to be influential variables, a finding that is consistent with empirical data.

      Figure 2. 

      The ROC curve for 3- and 5-year model predictions of disease onset. (A) 3-year ROC; (B) 5-year ROC.

      Note: Red indicates the ROC curve of the EPCR model, orange indicates the ROC curve of the PCR model, and lime green indicates the ROC curve of the Gail model.

      Abbreviation: ROC=receiver operating characteristic; EPCR=ensemble penalized Cox regression; PCR=penalized Cox regression; AUC=the areas under the receiver operating characteristic curve.

    • Most existing cancer prediction models can be divided into absolute risk models and relative risk models. The latter, however, is actually a single classifier that can only predict whether an individual is at high risk or not, but not an individual’s risk of developing cancer over time in the future. The widely-used Gail model (10), a breast cancer risk assessment tool, is an absolute risk model based on five breast cancer risk factors and their interactions.

      In recent years, ML has been used to improve the predictive performance of cancer prediction models. Most current studies have focused on ML methods using classifiers such as k-nearest neighbor (KNN) (11), random forest (12) (i.e., for the identification of high-risk individuals), or Support Vector Machine (SVM) (13) or logistic regression models (i.e., for the prediction of relative risk). Moreover, most of these models only utilize the label of cancer or not in the sample, and the follow-up information of the data is not fully utilized.

      At the same time, given that the databases used to develop tumorigenesis risk prediction models are mostly imbalanced, we propose applying ensemble learning methods to improve prediction performance. Specifically, the Bagging ensemble framework can be used to be able to better handle imbalanced data. Here, a PCR model was used as the base predictor, since it can make full use of follow-up information while also being able to adapt to high-dimensional data. Several simulation studies were carried out to verify the effectiveness of this method under different censoring rates settings. As shown in Table 1, the AUC based on the risk scores estimated by the EPCR model was consistently higher than that of a single PCR model or a traditional stepwise regression model under all settings. This suggests that the introduction of the Bagging ensemble framework can improve the predictive performance of PCR models, and this advantage becomes more apparent as the censoring rate increases. For example, compared to penalized logistics regression (PLR)-LASSO, the AUC of ensemble penalized logistics regression (EPLR)-LASSO increased by 1.5% and 2.2% for 30% and 70% deletion rates when $n=1,000, p=100$, respectively.

      In addition, the EPCR model allows for a more robust data-driven identification of risk factors. Under all simulation settings, we calculated FDR, FOR, TPR, and TNR values for variable screening. These results showed that EPCR-LASSO had the lowest FDR while maintaining a very high TPR. For example, for Setting 1, EPCR-LASSO had the lowest FDR (0.164 lower than PCR-LASSO) as well as a TPR greater than 0.93. This means that the variables identified by the EPCR-LASSO approach contain the fewest insignificant variables and the most significant variables compared to the other five models; that is, EPCR-LASSO is a more accurate approach for the identification of significant variables. Moreover, EPCR-LASSO continued to perform the best as the censoring rate increased. In addition, we also found that EPCR-EN can also significantly reduce FDR while maintaining the same level of TPR as PCR-EN. Taken together, these results suggest that the EPCR procedure is the best choice to use to identify important risk factors. For cancers whose etiology is unknown, the number of cases that can be used to train a prediction model is extremely small. Therefore, the exclusion or inclusion of a case can have a significant impact on the selection of risk factors. The EPCR model benefits from the Bagging ensemble framework to more robustly identify risk factors (14), which in turn can provide a more meaningful reference for studies of disease etiology.

      Next, we developed and validated a breast cancer risk prediction model by analyzing the large BCCS-CW database using the EPCR procedure. Compared to the classical Gail model, our model achieved a higher degree of discrimination with higher accuracy (Figure 2). The AUC for 3- and 5-year predictions of the EPCR model were 0.691 and 0.642, which represented improvements of 0.189 and 0.117 over the classic Gail model, respectively. The other published absolute risk prediction model for the Chinese population showed a maximum AUC of only 0.634 (15). The difference between our results and this model further demonstrates that cancer prediction models developed by the EPCR procedure are more accurate in identifying high-risk populations and may be more useful for rationally allocating healthcare resources under medical constraints.

      However, it is also important to be aware that the EPCR model developed here has limitations. The application of the EPCR procedure to develop disease prediction models is only applicable where the corresponding risk factors satisfy the proportional hazards assumption. This is because the EPCR model is actually an average of multiple COX regression models. However, in most cases, especially those containing high-dimensional data, the proportional hazards assumption does not hold. Therefore, the EPCR model is more suitable for short-term disease risk prediction. As shown in Figure 2, the 5-year AUC based on the risk score estimated by the EPCR model is lower than the 3-year AUC. Therefore, the actual effectiveness of the EPCR model in predicting risk may be lower when applied to a longer (e.g., 10-year) timeframes.

    • No conflicts of interest reported.

Reference (15)

Citation:

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return