RiGoR: reporting guidelines to address common sources of bias in risk model development
© Kerr et al.; licensee BioMed Central. 2015
Received: 27 November 2014
Accepted: 24 December 2014
Published: 24 January 2015
Reviewing the literature in many fields on proposed risk models reveals problems with the way many risk models are developed. Furthermore, papers reporting new risk models do not always provide sufficient information to allow readers to assess the merits of the model. In this review, we discuss sources of bias that can arise in risk model development. We focus on two biases that can be introduced during data analysis. These two sources of bias are sometimes conflated in the literature and we recommend the terms resubstitution bias and model-selection bias to delineate them. We also propose the RiGoR reporting standard to improve transparency and clarity of published papers proposing new risk models.
There is currently broad interest in developing risk prediction models in medicine. However, recent reviews in a variety of fields have described a substantial number of flaws in the way risk models are developed and/or deficiencies in the way the work is reported [1-6]. An extensive review that spanned many fields of application found the vast majority of papers reporting risk models omitted important details such as: the extent and handling of missing data; key information on the study population; and the precise definition of the outcome or event of interest . An evaluation of model calibration was typically absent. Additional issues include a tendency for models to be favorably evaluated when the model’s developers are involved in validating the model .
Research reports related to risk prediction sometimes refer to “optimistic bias” or “optimism bias” [4,7,8]. Unfortunately, these terms are used to refer to a variety of problems in risk model development or assessment. It would be useful to have clear, distinctive, and descriptive names for different sources of bias that can affect scientific results. The first goal of this review is to propose terminology for referring to two sources of bias that are common in developing risk models. Both biases can arise during data analysis, which makes them avoidable, at least in principle. The second goal of this paper is a proposal for a set of guidelines for reporting proposed new risk models. The guidelines should help readers evaluate the merits of new risk models and understand whether developers were attentive to avoiding common sources of bias.
Common sources of bias in risk model development
Currently, two sources of bias that arise in developing risk prediction models from combinations of biomarkers and/or clinical variables are both called “optimistic bias.” We propose the terms “resubstitution bias” and “model-selection bias” as more precise and descriptive terms than “optimistic bias.” A predictive model will tend to perform better on the data that were used to fit or “train” the model than on new data. Resubstitution bias arises when the data that are used to fit a predictive model are used a second time to assess the performance of the model. Re-using the data in this way has been called resubstitution [9-15], so it is a modest extension to refer to the resulting bias as resubstitution bias. Since the ultimate goal of a risk prediction model is to estimate risks on new individuals, assessing model performance via resubstitution does not provided an unbiased or “honest” estimate of the model’s predictive capacity.
Model-selection bias arises when many models are assessed, and the best performing model is reported. This optimistic bias persists even if analysts have corrected for resubstitution bias in assessing the model. Occasionally, investigators have a single, pre-specified model that they fit with data. In this case, the resulting model is susceptible to resubstitution bias but not to model-selection bias. More typically, however, data analysts have a set of candidate predictors to choose from, which translates to a set of possible models. For example, if there are k candidate predictors and an analyst limits the set of possible models to linear logistic models, then the number of possible models is 2k-1. For 20 candidate predictors this is over 1 million models, and we expect some of these to perform well by chance. Naïve assessment of the best-performing model is likely optimistic because this model is chosen because it performed best on the available data [16-19]. “Model-selection bias” refers to this particular source of optimism.
Although resubstitution bias and model-selection bias are well-known phenomena among methodologists and many data analysts [7,10,20], there is no standard terminology for referring to these sources of bias. We find the term “optimistic bias” inadequate for several reasons. “Optimistic bias” describes the direction of the bias and not the source of the bias, so it is insufficiently descriptive. By referring to multiple sources of bias with the same phrase researchers might claim to have addressed “optimistic bias” in developing a predictive model , when in fact they have only addressed one source of bias. Finally, in addition to resubstitution and model-selection there are additional phenomena referred to as “optimistic bias” , including the observation in psychology that people often underestimate personal risks .
We emphasize that resubstitution bias and model-selection bias are well-known among methodologists, and our modest contribution is our proposal for standard terms to refer to these issues. These terms have appeared occasionally (and rarely) in the literature [17,23,24] but are not in widespread use. Previously proposed terminology is “parameter uncertainty” and “model uncertainty , where “model uncertainty” is said to lead to “selection bias.” However, this terminology is not standard and we find it less descriptive than the terms we propose. Furthermore, “selection bias” has an established meaning in epidemiology, where it refers to non-representative selection of study subjects.
Methodology for estimating the performance of a risk model that is not optimistically biased from resubstitution includes bootstrapping techniques, cross-validation, and using independent datasets for model development and validation [7,25]. Bootstrapping and cross-validation are computationally intensive, and employing them can surpass the abilities of some data analysts or software packages. Moreover, there are different varieties of bootstrapping and cross-validation and a lack of consensus on the best procedure. A recent investigation  provides some much-needed guidance on the relative merits of different procedures for estimating the area under the ROC curve (AUC or “C statistic”) without resubstitution bias. Using independent datasets for model development and validation is computationally simpler, and provides stronger evidence in favor of a reported risk model if the validation dataset is from a separate study (“external validation”). More commonly, however, a validation dataset is created by data splitting – randomly partitioning the available data into a “training” dataset and a “test” or validation dataset. This strategy offers simplicity and flexibility in data analysis, but is criticized for its statistical inefficiency  because only part of the data inform development of the risk model. With data splitting there is an inherent tension between the amount of data allotted to the training dataset for developing the risk model, and the amount of data allotted to the test dataset for evaluating the risk model . If the training dataset is too small a good risk model might not be found. On the other hand, if the test dataset is too small then estimates of model performance, while unbiased, are highly variable, making promising results less compelling. An advantage of having an independent validation dataset is that both resubstitution and model-selection bias are accounted for as long as the validation dataset is not used in any stage of model development, including variable selection.
Model-selection bias tends to be more difficult to address without an independent validation dataset. In principle, model-selection can be incorporated into a bootstrapping or cross-validation procedure, but this requires the use of an automated model-building process and further increases the computational complexity of using these methods.
RiGoR: reporting guidelines for risk models
Section and topic
Identify the article as reporting the development of a risk model combining multiple predictors (MeSH “Risk”, possibly “risk factor” and/or “biomarker”)
Identify the overarching goal – why would an effective risk model be valuable to clinical care, public health, or research?
Describe the study subjects: The inclusion and exclusion criteria (and resulting sample sizes), setting and locations where the data were collected. Descriptive statistics should include variable ranges.
Describe participant recruitment.
Report when study was done, including beginning and ending dates of recruitment.
Describe the study design. Was this a cohort study? A case–control study? Note: matched case control studies are generally not suitable for risk model development unless special methods and external data are used.
Describe data collection, including timing of specimen collection for biomarker measurement. Document where there was blinding to clinical outcomes.
Document technical specifications of biomarker materials and methods, including marker units. Describe possibility of batch effects, storage effects, number of freeze/thaw cycles, assay upper and lower limits. Document how biomarker values at the limits of detection were handled.
For multi-center studies, document whether biomarker measurements can be considered comparable between study sites, or whether lab effect, platform differences, or variations in clinical practice may affect biomarker levels.
Describe how the outcome is defined (e.g., precise definition for disease diagnosis, or death from any cause vs. specific cause)
Document measures of model performance, e.g. AUC for risk models; sensitivity and specificity for a pre-selected risk threshold; report methods to quantify uncertainty (e.g., 95% confidence intervals via bootstrapping)
Document how markers were used: transformations (e.g., log)? categorization of continuous variables? Other adjustments (e.g. kidney biomarkers adjusted for urine creatinine)?
List all variables initially considered as candidates
Describe variable selection: how were variables selected to include in the risk model or classifier? Pre-specified prior to any analysis of the data? Selected based on univariate analysis? An exhaustive search over a set of models? Stepwise procedure?
Describe how model-selection bias was addressed in assessing the performance of final reported model(s). If model-selection bias was not addressed, state this explicitly.
Document methodology used to develop risk model or classifier: logistic regression? logic regression? relative risk regression?
Document methodology to avoid or correct for resubsitution bias in measures of the performance of the final reported model(s).
If an independent validation “test” dataset was used, document that the test data were not used for any part of model development, including variable selection. Document that these data were accessed only when models were finalized. Report the number of models evaluated on the “test” data and how these were selected.
If cross-validation is used, state how final reported model is derived.
For multi-center studies with the possibility of confounding by center, describe methods for adjusting or accounting for center effects.
Describe how indeterminate results and missing data were handled, or report that there were no indeterminate results or missing data.
Describe methods for assessing model calibration.
Report clinical and demographic characteristics of the study population (e.g. age, sex, presenting symptoms, co-morbidity, current treatments, recruitment centers).
Report final risk model or classifier
Report estimates of model performance with measures of uncertainty when possible (e.g., 95% confidence interval)
Assess and report evidence of risk model calibration.
Discuss prospects of final risk model for satisfying the research goal
Discuss known and possible limitations to generalizability or applicability of risk model
Previously published reporting standards that are related to risk model development are STARD , GRIPS , and REMARK . The STARD initiative  assembled a comprehensive set of standards “to improve the accuracy and completeness of reporting of studies of diagnostic accuracy, to allow readers to assess the potential for bias in the study (internal validity) and to evaluate its generalisability (external validity)” (http://www.stard-statement.org/). A primary result of the initiative is a 25 item checklist for articles reporting studies of diagnostic accuracy. The RiGoR guidelines are meant to emulate the contribution of STARD with a set of criteria tailored to the development of risk prediction instruments. The REMARK recommendations  were developed in the context of tumor markers with the potential to be used for prognosis. The focus of REMARK is markers for predicting time-to-event outcomes such as overall survival. In contrast, the focus of RiGoR is estimating patient risks of a binary outcome. The GRIPS statement  offers reporting standards focused on studies of risk prediction models that include genetic variants. The RiGoR guidelines are more general and more detailed.
In proposing the RiGoR standards, we both acknowledge and build upon the important previous efforts described above. For each RiGoR item, Table 1 notes similar STARD, REMARK, or GRIPS items. As Table 1 shows, most items are similar to criteria given in at least one of these previous reports. However, there are some notable exceptions. First, RiGoR includes a guideline that the calibration of a risk prediction model should be assessed and reported, as calibration is a necessary requirement for the validity of a model. While the importance of calibration is noted in many publications [6,30,31], it is not included in GRIPS. Second, our guidelines explicitly address resubstitution bias and model-selection bias, two common types of bias that can arise during risk model development.
There are items in the REMARK and GRIPS guidelines that are not included in RiGoR. In Appendix A we document our reasons for excluding these items.
In Epidemiology, common pitfalls in study design and data analysis commonly acquire standard names. Some examples include immortal time bias in survival analysis  and lead time bias in the evaluation of diagnostic screening tools . Publication bias is a widely recognized issue in the scientific literature . The most helpful terminology is descriptive; helps codify important concepts; and aids scientific communication. We believe the terms “resubstitution bias” and “model-selection bias” accomplish these goals.
In this article we have reviewed and discussed resubstitution bias and model-selection bias. We do not mean to suggest that they are the only two sources of bias that can affect risk model development. However, we believe resubstitution bias and model-selection bias deserve special attention because they are common. Furthermore, they are biases that arise during data analysis, which means, at least in principle, that they should be avoidable with use of proper methods of data analysis.
Other types of bias can enter into a study at earlier stages. For example, selection bias can inflate the performance of a proposed risk model if the cases in the dataset tend to be more severe than the population of cases, or controls tends to be healthier than the population of controls. Having an objective way to define the population of interest and to define the event of interest is an important aspect of a quality study. The RiGoR standards are designed to ensure that these and other important aspects of study design, conduct, and data analysis are documented.
The research was supported by the NIH grant RO1HL085757 (CRP) to fund the TRIBE-AKI Consortium to study novel biomarkers of acute kidney injury in cardiac surgery. CRP is also supported by NIH grant K24DK090203. SGC is supported by National Institutes of Health Grants K23DK080132 and R01DK096549. SGC and CRP are also members of the NIH-sponsored ASsess, Serial Evaluation, and Subsequent Sequelae in Acute Kidney Injury (ASSESS-AKI) Consortium (U01DK082185).
- Kyzas PA, Loizou KT, Ioannidis JP. Selective reporting biases in cancer prognostic factor studies. J Natl Cancer Inst. 2005;97(14):1043–55.PubMedView ArticleGoogle Scholar
- Kyzas PA, Denaxa-Kyza D, Ioannidis JP. Quality of reporting of cancer prognostic marker studies: association with reported prognostic effect. J Natl Cancer Inst. 2007;99(3):236–43.PubMedView ArticleGoogle Scholar
- Concato J, Feinstein AR, Holford TR. The risk of determining risk with multivariable models. Ann Intern Med. 1993;118(3):201–10.PubMedView ArticleGoogle Scholar
- Siontis GC, Tzoulaki I, Siontis KC, Ioannidis JP. Comparisons of established risk prediction models for cardiovascular disease: systematic review. BMJ. 2012;344:e3318.PubMedView ArticleGoogle Scholar
- Anothaisintawee T, Teerawattananon Y, Wiratkapun C, Kasamesup V, Thakkinstian A. Risk prediction models of breast cancer: a systematic review of model performances. Breast Cancer Res Treat. 2012;133(1):1–10.PubMedView ArticleGoogle Scholar
- Collins GS, Omar O, Shanyinde M, Yu LM. A systematic review finds prediction models for chronic kidney disease were poorly reported and often developed using inappropriate methods. J Clin Epidemiol. 2013;66(3):268–77.PubMedView ArticleGoogle Scholar
- Harrell FEJ. Regression Modeling Strategies. New York: Springer; 2001.View ArticleGoogle Scholar
- Hammond T, Verbyla D. Optimistic bias in classification accuracy assessment. Int J Remote Sens. 1996;7(6):1261–6.View ArticleGoogle Scholar
- Dupuy A, Simon RM. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J Natl Cancer Inst. 2007;99(2):147–57.PubMedView ArticleGoogle Scholar
- Molinaro AM, Simon R, Pfeiffer RM. Prediction error estimation: a comparison of resampling methods. Bioinformatics. 2005;21(15):3301–7.PubMedView ArticleGoogle Scholar
- Simon R. Diagnostic and prognostic prediction using gene expression profiles in high-dimensional microarray data. Br J Cancer. 2003;89(9):1599–604.PubMed CentralPubMedView ArticleGoogle Scholar
- Subramanian J, Simon R. Overfitting in prediction models - is it a problem only in high dimensions? Contemp Clin Trials. 2013;36(2):636–41.PubMedView ArticleGoogle Scholar
- Hanczar B, Hua J, Sima C, Weinstein J, Bittner M, Dougherty ER. Small-sample precision of ROC-related estimates. Bioinformatics. 2010;26(6):822–30.PubMedView ArticleGoogle Scholar
- Braga-Neto U, Hashimoto R, Dougherty ER, Nguyen DV, Carroll RJ. Is cross-validation better than resubstitution for ranking genes? Bioinformatics. 2004;20(2):253–8.PubMedView ArticleGoogle Scholar
- Way TW, Sahiner B, Hadjiiski LM, Chan HP. Effect of finite sample size on feature selection and classification: a simulation study. Med Phys. 2010;37(2):907–20.PubMed CentralPubMedView ArticleGoogle Scholar
- Ding Y, Tang S, Liao SG, Jia J, Oesterreich S, Lin Y, et al. Bias correction for selecting the minimal-error classifier from many machine learning models. Bioinformatics. 2014;30(22):3152–8.PubMedView ArticleGoogle Scholar
- Berrar D, Bradbury I, Dubitzky W. Avoiding model selection bias in small-sample genomic datasets. Bioinformatics. 2006;22(10):1245–50.PubMedView ArticleGoogle Scholar
- Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006;7:91.PubMed CentralPubMedView ArticleGoogle Scholar
- Tibshirani RJ, Tibshirani R. A bias-correction for the minimum error rate in cross-validation. Ann Appl Stat. 2009;3(2):822–9.View ArticleGoogle Scholar
- Steyerberg E. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York: Springer; 2009.View ArticleGoogle Scholar
- Smith GC, Seaman SR, Wood AM, Royston P, White IR. Correcting for optimistic prediction in small data sets. Am J Epidemiol. 2014;180(3):318–24.PubMed CentralPubMedView ArticleGoogle Scholar
- Fontaine KR, Smith S. Optimistic bias in cancer risk perception: a cross-national study. Psychol Rep. 1995;77(1):143–6.PubMedView ArticleGoogle Scholar
- Yu YP, Landsittel D, Jing L, Nelson J, Ren B, Liu L, et al. Gene expression alterations in prostate cancer predicting tumor aggression and preceding development of malignancy. J Clin Oncol. 2004;22(14):2790–9.PubMedView ArticleGoogle Scholar
- Hathaway B, Landsittel DP, Gooding W, Whiteside TL, Grandis JR, Siegfried JM, et al. Multiplexed analysis of serum cytokines as biomarkers in squamous cell carcinoma of the head and neck patients. Laryngoscope. 2005;115(3):522–7.PubMedView ArticleGoogle Scholar
- Kerr KF, Meisner A, Thiessen-Philbrook H, Coca SG, Parikh CR. Developing risk prediction models for kidney injury and assessing incremental value for novel biomarkers. Clin J Am Soc Nephrol. 2014;9(8):1488–96.PubMedView ArticleGoogle Scholar
- Dobbin KK, Simon RM. Optimally splitting cases for training and testing high dimensional classifiers. BMC Med Genomics. 2011;4:31.PubMed CentralPubMedView ArticleGoogle Scholar
- McShane LM, Altman DG, Sauerbrei W, Taube SE, Gion M, Clark GM. Diagnostics SSotN-EWGoC: REporting recommendations for tumor MARKer prognostic studies (REMARK). Nat Clin Pract Oncol. 2005;2(8):416–22.PubMedView ArticleGoogle Scholar
- Bossuyt PM, Reitsma JB, Bruns DE, Gatsonis CA, Glasziou PP, Irwig LM, et al. Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD Initiative. Ann Intern Med. 2003;138(1):40–4.PubMedView ArticleGoogle Scholar
- Janssens AC, Ioannidis JP, van Duijn CM, Little J, Khoury MJ, Group G. Strengthening the reporting of Genetic RIsk Prediction Studies: the GRIPS Statement. PLoS Med. 2011;8(3):e1000420.PubMed CentralPubMedView ArticleGoogle Scholar
- Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21(1):128–38.PubMed CentralPubMedView ArticleGoogle Scholar
- Pepe M, Janes H. Methods for Evaluating Prediction Performance of Biomarkers and Tests. In: Lee M-LT, Gail M, Pfeiffer R, Satten G, Cai T, Gandy A, editors. Risk Assessment and Evaluation of Predictions. New York: Springer; 2013. p. 107–42.View ArticleGoogle Scholar
- Suissa S. Immortal time bias in observational studies of drug effects. Pharmacoepidemiol Drug Saf. 2007;16(3):241–9.PubMedView ArticleGoogle Scholar
- Hutchison GB, Shapiro S. Lead time gained by diagnostic screening for breast cancer. J Natl Cancer Inst. 1968;41(3):665–81.PubMedGoogle Scholar
- Dickersin K. The existence of publication bias and risk factors for its occurrence. JAMA. 1990;263(10):1385–9.PubMedView ArticleGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.