The Burden of Proof studies: assessing the evidence of risk

Zheng P, Afshin A, Biryukov S, Bisignano C, Brauer M, Bryazka D, Burkart K, Cercy K, Cornaby L, Dai X, Dirac A, Estep K, Fay K, Feldman R, Ferrari A, Gakidou E, Gil G, Griswold M, Hay S, He J, Irvine C, Kassebaum N, LeGrand K, Lescinsky H, Murray C (2022) Nature Medicine 28, 2038–2044 DOI: 10.1038/s41591-022-01973-2

Web URL: Read this article on Nature

Abstract:

Exposure to risks throughout life results in a wide variety of outcomes. Objectively judging the relative impact of these risks on personal and population health is fundamental to individual survival and societal prosperity.

Existing mechanisms to quantify and rank the magnitude of these myriad effects and the uncertainty in their estimation are largely subjective, leaving room for interpretation that can fuel academic controversy and add to confusion when communicating risk.

We present a new suite of meta-analyses—termed the Burden of Proof studies—designed specifically to help evaluate these methodological issues objectively and quantitatively.

Through this data-driven approach that complements existing systems, including GRADE and Cochrane Reviews, we aim to aggregate evidence across multiple studies and enable a quantitative comparison of risk–outcome pairs. We introduce the burden of proof risk function (BPRF), which estimates the level of risk closest to the null hypothesis that is consistent with available data.

Here we illustrate the BPRF methodology for the evaluation of four exemplar risk–outcome pairs: smoking and lung cancer, systolic blood pressure and ischemic heart disease, vegetable consumption and ischemic heart disease, and unprocessed red meat consumption and ischemic heart disease. The strength of evidence for each relationship is assessed by computing and summarizing the BPRF, and then translating the summary to a simple star rating.

The Burden of Proof methodology provides a consistent way to understand, evaluate and summarize evidence of risk across different risk–outcome pairs, and informs risk analysis conducted as part of the Global Burden of Diseases, Injuries, and Risk Factors Study.

Main

Exposure to different risk factors plays an important role in the likelihood of an individual developing or experiencing more severe outcomes from certain diseases, such as high blood pressure increasing the risk of heart disease or not having access to a safe water source increasing the risk of diarrheal diseases1. Understanding and quantifying the relationship between risk factor exposure and the risk of a subsequent outcome is therefore essential to set priorities for public policy, to guide public health practices, to help clinicians advise their patients and to inform personal health choices.

Consequently, information on risk–outcome relationships can be used in the formulation of many types of public policies, including national recommendations on diet, occupational health rules, regulations on behavior such as smoking in public places, and guidance on appropriate levels of taxes and subsidies. As new evidence is continuously being produced and published, the systematic and comparable assessment of risk functions is a dynamic challenge. Up-to-date assessments of risk–outcome relationships are essential to, and a core component of, the Global Burden of Diseases, Injuries, and Risk Factors Study (GBD) comparative risk assessment (CRA)1,2,3, which aims to help decision-makers understand the magnitude of different health problems.

Evidence on risk–outcome relationships comes from many types of studies, including randomized controlled trials (RCTs), cohort studies, case-control studies, cross-sectional analyses, ecological studies and animal studies. Each study type has characteristic strengths and weaknesses.

For example, RCTs are the most robust method for dealing with confounding but are often conducted with strict inclusion and exclusion criteria, meaning that trial participants are unlikely to be fully representative of the general population, as well as being done over relatively short durations3,4,5.

Case-control studies are well suited for understanding the risks linked to rare outcomes but may be subject to recall bias for past exposure6,7.

Animal studies are widely used in evaluating the risks of consumer products and environmental risks but may not be generalizable to humans8.

Study design and analysis impacts causal interpretation and understanding of the results9. When synthesizing evidence from different studies, strong assumptions—usually that of a log-linear relationship between risk and exposure—are often made to increase the mathematical tractability of the analysis10,11,12. Between-study heterogeneity—that is, disagreement in study-specific inferred relationships between risk exposure and outcome—is quantified in meta-analytic summaries, and has some effect on fixed-effects variance estimates, but is not otherwise used in the overall assessments of the uncertainty in risk–outcome relationships12,13.

Risk factors associated with comparatively modest increases in the hazard are often questioned because of the potential for residual confounding14. Given the very mixed evidence landscape, it is perhaps not surprising that there are so many controversies in the literature15,16,17,18.

While evidence is often heterogeneous, the need for clear guidance has led national advisory groups and international organizations to use expert committees to evaluate the evidence and formulate recommendations. The biggest advantage of expert groups is their ability to carefully consider nuances in the available evidence, but they are inherently subjective.

For instance, expert groups across subfields of health science weight types of evidence differently, and even groups of experts within the same subfield may arrive at divergent conclusions. These expert groups often use meta-analyses of the available evidence, such as those produced by the Cochrane Collaborations19, as an input to their deliberations. Even Cochrane Reviews, however, allow authors to use a range of methodologies and approaches to studies on risk of bias, limiting comparability across risk–outcome pairs19.

Tools have been produced to help standardize consideration of evidence, such as Grading of Recommendations, Assessment, Development and Evaluations (GRADE20,21), but while very helpful, they cannot be implemented algorithmically. No quantitative assessment of the evidence can or should substitute completely for expert deliberation, but a quantitative meta-analytic approach that addresses some of the issues identified by GRADE and others could be a useful input to international and national expert committee considerations.

Here, we propose a complementary approach, in which we quantify the mean relationship (the risk function) between risk exposure and a disease or injury outcome, after adjusting for known biases in the existing studies. Unlike existing approaches, our approach does not force log-linearity in risk functions or make additional approximations, such as midpoint approximations for ranges or shared reference groups22,23,24.

To quantify the effect of bias, we considered risk of bias criteria that inform GRADE20,21, Cochrane Reviews19 and evidence-based practice, and consulted widely outside of the Institute for Health Metrics and Evaluation, including with clinicians, physicians, medical and public health researchers and national health policy-makers (for example, former Ministers of Health). We encoded these variables that are used to assess risk of bias as potential study-level bias covariates within the proposed meta-analytic framework.

This approach complements GRADE and Cochrane Reviews, which require analysts to assess and flag risks of bias. We then developed the burden of proof risk function (BPRF), which complements the mean risk and is defined as the smallest level of excess risk (closest to no relationship) that is consistent with the data. To aid interpretation of the results, we classify risk–outcome pairs into five categories (star ratings of one to five) based on the average magnitude of the BPRF. To illustrate this approach to assessing risk–outcome relationships, we provide four selected examples, showing both weak and strong risk–outcome relationships.

Results

Overview

To support estimation of the BPRF, we developed a meta-analytic approach that addresses a number of issues that have previously limited interpretations of the available evidence. This approach relaxes the assumption that the relative risk of an outcome increases exponentially as a function of exposure, standardizes the assessment of outliers, explicitly handles the range of exposure in a study in both the ‘alternative’ groups (numerator) and ‘reference’ groups (denominator) of a relative risk, tests for systematic bias as a function of study design using automatic covariate selection, and quantifies between-study heterogeneity while adjusting for the number of studies.

Using unexplained between-study heterogeneity and accounting for small numbers of studies, we estimate the BPRF as the 5th (if harmful) or 95th (if protective) quantile risk curve closest to the null (relative risk equal to 1). We flag evidence of the small-study effect (significant association between mean effect and standard error) as an indicator of potential publication or reporting bias.

We evaluated the BPRF for 180 risk–outcome pairs in the GBD CRA framework. To simplify communication, we then computed the associated risk–outcome score (ROS) for each pair by averaging the BPRF across a relevant exposure interval and converted each ROS into a star rating from one to five.

One star refers to risk–outcome pairs where a conservative interpretation of the evidence—accounting for all uncertainty including between-study heterogeneity—may suggest there is no association and two–five stars refers to risk–outcome pairs where a conservative interpretation of the evidence may suggest that, for harmful effects, average exposure increases excess risk relative to the level of exposure that minimizes risk from 0 to 15% (two stars; weak evidence of association), from >15 to 50% (three stars; moderate evidence of association), from >50 to 85% (four stars; strong evidence of association) and >85% (five stars; very strong evidence of association), and for protective effects, decreases excess risk relative to no exposure from 0 to 13% (two stars), from >13 to 34% (three stars), from >34 to 46% (four stars) and >46% (five stars).

The corresponding ROS thresholds for both harmful and protective risks are 0–0.14 for two stars, >0.14–0.41 for three stars, >0.41–0.62 for four stars and >0.62 for five stars. Of the 180 risk–outcome pairs investigated, 40 risk–outcome pairs were given a one-star rating, 72 pairs were given a two-star rating, 46 were given a three-star rating, 14 were given a four-star rating and 8 were given a five-star rating (Table 1). Here, we present results from each step of the evaluation process for four risk–outcome pairs to demonstrate how our methodology can be applied to pairs across the ROS spectrum and across a range of available study types and risk curve shapes, varying levels of between-study heterogeneity, and varying numbers of data points and studies. These four pairs also allow us to demonstrate how policy-makers should interpret our findings for both strong and weak risk–outcome relationships.

Table 1 BPRF and ROS ranges associated with each star rating, and number of risk–outcome pairs assigned to each star rating

Full size table

Smoking and lung cancer (five stars)

We used a standardized approach to search for and extract data from published studies on the relationship between pack-years smoked and the log relative risk of lung cancer, resulting in 371 observations from 25 prospective cohort studies and 53 case-control studies (three of them nested) reported from 1980 onwards (Fig. 1; step 1 in Methods)25. The studies spanned a wide range of pack-years of smoking, from nearly one to over 112 pack-years. We found the 15th percentile of exposure in the reference group to be zero pack-years (and the 85th percentile of exposure among exposed groups in the cohort studies to be 50.88 pack-years (Fig. 1a,b).

On average, we found a very strong relationship between pack-years of smoking and log relative risk of lung cancer (step 2 in Methods). At 20 pack-years, the mean relative risk (an effect size measure) was 5.11 (95% uncertainty interval (UI) 1.84–14.99), and at 50.88 pack-years (85th percentile of exposure) it was 13.42 (2.63–74.59) (Fig. 1b and Supplementary Table 1).

The relationship is not log-linear, with declining effects of further pack-years of smoking, particularly after 40 pack-years. In the analysis of bias covariates (step 3 in Methods), we adjusted data from studies that did not adjust for more than five confounders, including age and sex. There is enormous heterogeneity in the reported relative risk for lung cancer across studies (Fig. 1b; step 4 in Methods). In trimming 10% of observations, we identified observations both above and below the cloud of points, which we excluded (step 5 in Methods). The mixed-effects models fit the data, that is, the reported uncertainty together with estimated between-study heterogeneity covers the estimated residuals, as Fig. 1c demonstrates.

Even taking the most conservative interpretation of the evidence—the 5th quantile risk function including between-study heterogeneity, or the BPRF—smoking dramatically increases the risk of lung cancer (Fig. 1a,b). There is evidence of potential reporting or publication bias (Fig. 1c). The BPRF suggests that smoking in the range of the 15th–85th percentiles of exposure raises the risk of lung cancer by an average of 106.7%, for an ROS of 0.73 (step 6 in Methods). These findings led us to classify smoking and lung cancer as a five-star risk–outcome pair.

Systolic blood pressure and ischemic heart disease (five stars)

We extracted 189 observations from 41 studies (39 RCTs, 1 cohort and 1 pooled cohort) quantifying the relationship between systolic blood pressure (SBP) and ischemic heart disease (Fig. 2)26. We included RCTs designed to compare the health effects of different levels of blood pressure. Head-to-head trials of drug classes or combinations not designed to achieve different levels of SBP were excluded.

We calculated the 15th percentile of exposure in the cohorts and trials to be an SBP of 107.5 mm Hg and the 85th percentile to be 165 mm Hg (Fig. 2a,b). The relationship is close to log-linear, although it appears to flatten out and deviate from the log-linear assumption over an SBP of 165 mm Hg (though the data are sparse over this level). An SBP of 140 mm Hg had a mean relative risk of ischemic heart disease of 2.38 (2.17–2.62) compared to 100 mm Hg, while an SBP of 165 mm Hg had a mean relative risk of 4.48 (3.81–5.26) compared to 100 mm Hg. (Fig. 2b and Supplementary Table 2). Trimming removed 10% of outlying observations with high relative risk at SBP levels between 125 and 180 mm Hg and low relative risk at SBP levels between 130 and 175 mm Hg (Fig. 2b).

In the analysis of bias covariates, we found that none had a significant effect. Because the RCTs and cohorts are very consistent and because there are many consistent studies within each type, between-study heterogeneity is small (Fig. 2a,b). While there is little asymmetry in the funnel plot (Fig. 2c), we found statistically significant evidence of small-study bias using an Egger’s regression (Egger’s regression P value <0.05). Given the small between-study heterogeneity, the BPRF suggests that SBP in the range from the 15th to 85th percentile of exposure raises the risk of ischemic heart disease by an average of 101.36%, for an ROS of 0.70. These findings led us to classify SBP and ischemic heart disease as a five-star risk–outcome pair.

Vegetable consumption and ischemic heart disease (two stars)

Figure 3 summarizes the cohort data on vegetable consumption and ischemic heart disease using 78 observations from 17 cohort studies27. The relationship is not log-linear.

We found that on average, vegetable consumption was protective, with the relative risk of ischemic heart disease being 0.81 (0.74–0.89) at 100 grams per day vegetable consumption compared to 0 grams per day (Supplementary Table 3). Incrementally higher levels of exposure are associated with less steep declines in relative risk compared to those at lower levels of exposure (Fig. 3b). For this pair, trimming removed one observation that suggested a weaker protective effect size than the mean estimate, and seven observations that suggested a stronger protective effect than the mean estimate. Including between-study heterogeneity expanded the UI only slightly (Fig. 3a,b), suggesting strong agreement between studies.

In the analysis of bias covariates, three were found to have a significant effect: incomplete confounder adjustment, incidence outcomes only and mortality outcomes only. The funnel plot (Fig. 3c) shows that after trimming, residual standard error (reflecting both study data variance and between-study heterogeneity) is within the expected range of the model. While there is little asymmetry in the funnel plot (Fig. 3c), we found statistically significant evidence of small-study bias using an Egger’s regression (Egger’s regression P value = 0.044). The BPRF suggests that vegetable consumption in the range of the 15th to the 85th percentile lowers risk of ischemic heart disease by 12.10% on average (ROS of 0.13). This leads to vegetable consumption and ischemic heart disease being classified as a two-star pair.

Unprocessed red meat and ischemic heart disease (two stars)

We identified 43 observations from 11 prospective cohort studies on unprocessed red meat and ischemic heart disease (Fig. 4)28. At an exposure of 50 grams per day, the mean relative risk is 1.09 (0.99–1.18) compared to 0 grams per day, and at 100 grams per day, it is 1.12 (0.99–1.25) (Fig. 4b and Supplementary Table 4).

In the analysis of bias covariates, we found that none had a significant effect. Trimming removed five observations that reported extreme values across the range of red meat consumption. There is no visual evidence or finding of potential publication or reporting bias (Fig. 4c).

For unprocessed red meat and ischemic heart disease, the exposure-averaged BPRF is 0.01, essentially on the null threshold (Fig. 4a), equating to an ROS of 0.01, with a corresponding increase in risk of 1.04%. These findings led this risk–outcome pair to be classified as a (nominal) two stars, on the threshold between weak evidence and no evidence of association for the risk–outcome pair.

Model validation

To validate key aspects of the meta-regression tool, we ran detailed simulation experiments (step 7 in Methods). We found that the approach proposed in this study outperformed existing approaches, particularly for non-log-linear relationships (Fig. 5 and Extended Data Figs. 1–6).

Discussion

Using a meta-analytic approach built using open-source tools, we estimated both the mean risk function and the BPRF for 180 risk–outcome pairs and assigned them a star rating based on the strength of the evidence (indicated by ROS that aggregate BPRF across standard exposure ranges) and severity of the risk. We achieved this by capturing the shape of the relationship between exposure and the risk of an outcome, detecting outliers using robust statistical methodology (trimming), testing and correcting for bias related to study design, and estimating between-study heterogeneity, adjusted for the number of studies.

The BPRF is the level of elevated risk for a harmful factor (or the level of reduced risk for a protective factor) based on the most conservative (closest to null) interpretation compatible with the available evidence. It is a reflection of both the magnitude of the risk and the extent of the uncertainty surrounding the mean risk function. The four examples in the results section demonstrate the range of evidence, between-study heterogeneity and mean relative risks across risk–outcome pairs, and how these factors impact the BPRF and star rating. Importantly, only 22 of 180 pairs received a four- or five-star rating (12.22%), whereas 112 received a one- or two-star rating (62.22%).

The BPRF and associated star ratings, as well as the background rates of burden for the outcomes of concern, are intended to be useful for informing individual choices on risk exposure.

For example, harmful risk–outcome pairs with four- and five-star ratings are associated with an increase in risk of more than 50% for the exposed (and more than a 34% decrease in risk for protective risks), even based on the most conservative interpretation of the evidence. For these risks, the mean effect size is often much higher. Harmful risk–outcome pairs with three stars have average increases in risk ranging from more than 15% to 50% (and a decrease of at least 13–34% for protective risks), even in the BPRF, and may be much higher depending on the individual level of risk exposure.

Further, some risks have high star ratings for multiple outcomes, such as high systolic blood pressure increasing risk of ischemic heart disease and stroke, and smoking increasing risk of lung cancer, aortic aneurysm, peripheral artery disease, laryngeal cancer and other pharynx cancer (all five-star pairs), which should be considered when making individual decisions around risk exposure.

Conversely, individuals can reasonably pay less attention to risks with a one-star rating. These may be real risks with small but meaningful benefits for individuals if their exposure is reduced, but the existing evidence is too limited to make stronger conclusions. Of course, individual choice should also be informed by the background risk of an outcome for an individual and the totality of risk–outcome pairs associated with a risk; a five-star relationship for a rare outcome may not be something that an individual would choose to act on, whereas three-star ratings for one risk and a set of common outcomes may warrant more action.

While the general public and committees formulating guidelines on individual behaviors—such as recommended diets—should pay attention to the star ratings, policy-makers should consider the impact of all risk–outcome pairs, not only those with high star ratings. These higher-star relationships should reassure decision-makers that the evidence supporting a risk factor is strong, but it would be unwise for decision-makers to ignore all one- and two-star risk–outcome pairs.

The precautionary principle implies that public policy should pay attention to all potential risks. Lower star rating risk–outcome pairs may turn out to be null as evidence accumulates, but it is unlikely that a set of one-star risks will all turn out to be null. Public policy to address risks, even those where the BPRF indicates that risk is small or even nonexistent, will, on average, improve health. At the same time, investing in more widespread data collection for pairs with lower star ratings will reduce uncertainty and allow policy-makers to be more strategic in addressing potential risks (as star ratings may go up or down with more evidence).

For example, due to very high heterogeneity between studies, a conservative interpretation of the available evidence suggests that there is weak to no evidence of an association between red meat consumption and ischemic heart disease. There is, therefore, a critical need for more large-scale, high-quality studies on red meat consumption so policy-makers can make better-informed decisions about how to prioritize policies that address this potential risk.

Moreover, public policy should pay attention not only to the risk functions that are supported by evidence but the prevalence of exposure to those risks. For example, a two-star risk with high prevalence of exposure could pose a greater risk at the population level than a five-star risk with low prevalence of exposure. The GBD CRA1,2,3 provides a framework for incorporating the BPRF, the prevalence of exposure and the background rates of specific outcomes to help policy-makers evaluate the importance of risk–outcome pairs across the full range of star ratings. In the future, risk–outcome pairs with one- and two-star ratings should be investigated further through more robust, well-powered research, especially for those risks where exposure and outcome are common, so policy-makers and individuals alike can better understand whether there is a real association between risk and outcome.

The BPRF and associated star ratings have immediate applications for GBD and its users. For GBD 2020, 180 risk–outcome pairs have so far been analyzed using this approach. The remaining risk–outcome pairs will be evaluated using this meta-analytic approach in subsequent GBD rounds. Since different users will be interested in the GBD results focusing on certain star rating categories, we have developed online visualization tools (https://vizhub.healthdata.org/burden-of-proof/) that allow users to filter results by star rating. Providing dynamic tools with this capability will empower users with different thresholds for considering risk–outcome pairs and will allow broader audiences to access this information. These tools are intended to fill in a gap in the landscape of risk assessment accessibility and transparency.

The standard approach to estimate the relationship between a risk and outcome has been to compute the mean across the universe of studies. We believe, however, that it is useful to report both the mean risk function and the BPRF, and that the more conservative interpretation may be more appropriate, particularly for exposures associated with small increases in risk, because of the risk of residual confounding. By including between-study heterogeneity in the uncertainty estimation and using this estimated uncertainty to compute a 5th or 95th quantile risk function (our BPRF), our risk assessment accounts for results that vary drastically across studies even after correcting for biases due to study design. This highlights the importance of accounting for unexplained between-study heterogeneity when estimating uncertainty and significance testing.

In particular, when the BPRF spans zero (that is, the risk is one star), a conservative interpretation of the evidence is consistent with no association between the risk and the outcome. We argue that the field should eventually move to incorporating between-study heterogeneity into significance testing of the mean function. Our meta-analytic approach uses splines to estimate the shape of the risk function without imposing a functional form such as log-linearity, and can be widely applied to other risk–outcome pairs not included in this analysis. This flexibility is an important strength of our approach because many risk–outcome pairs do not have a log-linear relationship.

When there are strong threshold effects, log-linear risk functions can exaggerate risk at higher exposure levels and obfuscate important detail at lower exposure levels. This more flexible approach helps identify the true shape of the risk function. Previously, the main challenge had been that if

FAB RESEARCH COMMENT:

For the related news article please see:

How bad is red meat for you? Health risks get star ratings

< Wang Y et al 2022 - Maternal consumption of ultra-processed foods and subsequent risk of offspring overweight or obesity: results from three prospective cohort studies

Francis H et al 2022 - Kynurenic acid as a biochemical factor underlying the association between Western-style diet and depression: A cross-sectional study >

Food and Behaviour Research is a registered charity (No SC034604) and a company limited by guarantee (Co No SC 253448)

Newsletter Signup

Useful Links

Contact Us

FAB Research
23 Carlton Road
Oxford
OX2 7SA

Important Notice

Medical opinion and guidance should always be sought for any symptoms that might possibly reflect a known or suspected disease, disorder or medical condition. Information provided on this website (or by FAB Research via any other means) does not in any way constitute advice on the treatment of any medical condition formally diagnosed or otherwise.

Alcohol: What Women Need to Know - BOOK HERE