Causal Attribution

Did the Program Cause the Outcome?

Module 5: Program Evaluation and Outcomes Depth: Application | Target: ~1,500 words

Thesis: The hardest question in program evaluation is “did the program cause the outcome?” — and most evaluations answer it poorly because they confuse correlation with causation or fail to construct a credible counterfactual.


The Attribution Problem

A federally qualified health center implements a HRSA-funded behavioral health integration initiative. Twelve months later, PHQ-9 screening rates have increased from 18% to 53%. The semi-annual progress report states: “The program resulted in a 35-percentage-point increase in behavioral health screening.” The funder reads this and should ask the question that most progress reports never answer: how do you know the program caused the increase?

This is the attribution problem. It is the central challenge in program evaluation, and it is the challenge that most grant-funded programs handle poorly or not at all. The program implemented an intervention. An outcome changed. The report asserts a causal link between the two. But temporal sequence is not causation. The fact that the outcome changed after the program was implemented does not establish that the outcome changed because the program was implemented. At least four alternative explanations exist for any observed improvement:

Secular trend. The outcome was already improving before the program began, and would have continued improving without it. If statewide PHQ-9 screening rates increased 20 percentage points over the same period due to CMS quality measure incentives and EHR vendor updates that embedded screening prompts, then the program’s incremental contribution is not 35 points — it is 35 minus whatever portion of the increase would have occurred anyway.

Confounding interventions. Other changes occurred during the same period that could explain the outcome. A new behavioral health provider was hired outside the grant at one of three sites. The health system implemented a new EHR with embedded clinical decision support. The state Medicaid program added a quality bonus for depression screening. Each of these could independently increase screening rates. The grant program may have contributed, but it was not the only thing that changed.

Regression to the mean. If the baseline period was unusually low — a COVID-era trough, a staffing crisis, a system migration that disrupted workflows — the observed improvement may simply be a return to normal. An organization that measures its baseline during its worst quarter will observe improvement that is partly recovery, not program effect.

Selection bias. If the program targeted sites or populations that were already motivated to change, the observed improvement may reflect pre-existing momentum rather than program impact. Sites that apply for behavioral health integration grants are, by definition, sites that have identified behavioral health as a priority. Their trajectory may differ from non-applicant sites regardless of the grant.

The attribution problem is not a statistical nuisance. It is the difference between knowing whether a $2.4 million investment produced results and hoping it did. Shadish, Cook, and Campbell (2002) define internal validity as the degree to which a study supports a causal inference — and they demonstrate, comprehensively, that observational designs without counterfactual reasoning have weak internal validity regardless of how carefully the outcome is measured.


The Counterfactual

To attribute an outcome to a program, you must answer a question that cannot be directly observed: what would have happened without the program? This unobserved scenario is the counterfactual. It is the foundation of all causal inference, and it is never directly available. It must be constructed through evaluation design.

The fundamental problem of causal inference, as Angrist and Pischke state in Mostly Harmless Econometrics (2009), is that we can never observe the same unit in both the treated and untreated state at the same time. The FQHC either received the grant or it did not. We observe the outcome under the condition that actually occurred. The outcome under the alternative condition — the counterfactual — must be estimated.

How credibly the counterfactual is constructed determines how credibly the program effect is estimated. This is not a matter of statistical sophistication. It is a matter of evaluation design — decisions made before data is collected about how the counterfactual will be approximated.


Evaluation Designs Ranked by Attribution Strength

Not all evaluation designs are equally credible for causal attribution. The hierarchy matters because grant programs must choose a design, and the choice determines whether the resulting evidence will survive scrutiny.

Randomized controlled trials (RCTs). The gold standard for causal inference. Eligible units (clinics, patients, communities) are randomly assigned to receive the program or not. Randomization ensures that, in expectation, the treatment and control groups are identical on all characteristics — observed and unobserved — except the program. Any difference in outcomes is attributable to the program. RCTs are rarely feasible for grant-funded programs: funders do not typically fund programs and then withhold the intervention from half the recipients. Ethical and political constraints further limit randomization in healthcare service delivery. But where feasible — cluster-randomized trials across clinic sites, wait-list control designs that stagger implementation — RCTs provide the strongest causal evidence. The Community Preventive Services Task Force and Biglan et al. (2000) have demonstrated that community-level RCTs are possible when designed with stakeholder engagement and phased rollout.

Quasi-experimental designs. The practical workhorse for grant program evaluation. These designs construct a counterfactual without randomization, using statistical methods to approximate what randomization would have achieved. Three methods are most relevant:

Difference-in-differences (DiD). Compare the change in the outcome for the program group to the change in the outcome for a comparison group over the same period. The program effect is the difference between the two changes. If the program sites’ screening rate increased 35 points while comparable non-program sites’ screening rate increased 20 points, the estimated program effect is 15 points. DiD requires a credible comparison group — sites that are similar enough to the program sites that their trend would have been the same absent the program. The critical assumption is parallel trends: the two groups would have changed at the same rate without the intervention. This assumption is testable with pre-intervention data (do the groups show similar trends before the program?) but cannot be proven.

Interrupted time series (ITS). Measure the outcome at many time points before and after the intervention. The program effect is estimated as the change in level or trend at the intervention point, after accounting for the pre-existing trajectory. If PHQ-9 screening rates were increasing at 1 point per quarter before the program and jumped 10 points at implementation before continuing to increase at 2 points per quarter, the ITS analysis can decompose the total change into pre-existing trend, level shift at intervention, and trend change at intervention. ITS requires many pre- and post-intervention data points — a minimum of 8-12 per period is recommended by the Cochrane Effective Practice and Organisation of Care group. It does not require a comparison group but is strengthened by one (controlled ITS). The method is well-suited to grant programs because many healthcare outcomes are routinely measured over time.

Propensity score matching (PSM). Construct a comparison group by matching program participants to non-participants who had similar probability of participating, based on observable characteristics. PSM addresses selection bias but only on observed variables — it cannot account for unmeasured differences between groups. Rosenbaum (2002) provides the foundational framework and sensitivity analyses for assessing how robust PSM results are to hidden bias.

Pre-post comparison. Measure the outcome before the program and after the program. Calculate the difference. This is what most grant progress reports do. It is the weakest design with any quantitative component because it conflates the program effect with every other change that occurred during the same period. The 35-point screening increase includes the program effect, the secular trend, the confounding interventions, and regression to the mean — all mixed together with no way to separate them. Pre-post comparison is not worthless — it establishes that the outcome changed — but it cannot attribute that change to the program.

Narrative evaluation. “We implemented the program. Things improved.” No quantitative measurement of outcomes, no baseline, no comparison, no counterfactual. This is not evaluation. It is a program description. It survives only when the funder does not scrutinize the evidence.


Healthcare Example: Telehealth Program Attribution

A HRSA-funded telehealth program deploys behavioral health video visits at 3 rural critical access hospital-affiliated clinics over 18 months. The grant final report states: “Behavioral health visit volume increased 35% following telehealth implementation.”

The 35% number is real. The question is what caused it.

The naive claim: The telehealth program produced a 35% increase in BH visit volume. This is a pre-post comparison with no counterfactual.

Challenge 1: Secular trend. Statewide behavioral health visit volume increased 20% over the same 18-month period, driven by expanded Medicaid BH coverage, parity enforcement, and growing public awareness. The telehealth program’s incremental effect above the secular trend is at most 15 percentage points.

Challenge 2: Confounding intervention. One of the three sites hired a new licensed clinical social worker during the grant period using non-grant funds. That site’s BH volume increased 55%; the other two averaged 25%. The new hire, not the telehealth platform, likely drove the outlier site’s results. Including the confounded site inflates the program-wide estimate.

Challenge 3: Regression to the mean. The 12-month pre-period included 6 months of COVID-era volume depression. BH visits during the baseline were 30% below the site’s 3-year average. Some of the observed “increase” is recovery to normal volume, not program effect.

Rigorous analysis: An interrupted time series using 24 months of pre-implementation monthly visit data and 18 months of post-implementation data, controlling for statewide BH visit trends, estimates the program effect at 8-12% — a real but substantially smaller effect than the naive 35% claim. Excluding the confounded site narrows the estimate further to 6-10%.

What the grant report should say: “BH visit volume increased 35% during the grant period. Adjusting for statewide trends and site-level confounders using interrupted time series analysis, we estimate the telehealth-attributable increase at 8-12%. The program produced a meaningful but modest effect on visit volume, with larger effects on access metrics (reduced travel burden, expanded appointment availability) that are less susceptible to the confounders affecting volume.”

Both numbers — the 35% and the 8-12% — belong in the report. The methodology should be transparent. Funders who see only the 35% will eventually encounter a reviewer who asks the counterfactual question. Programs that have already answered it demonstrate evaluation credibility that strengthens continuation applications.


Why This Matters for Grants

The stakes are shifting. Federal funders — particularly HRSA and SAMHSA — are increasing expectations for evaluation rigor. The CDC’s Framework for Program Evaluation in Public Health (1999) established the standard: evaluation should address attribution, not merely document implementation. HRSA’s Evidence-Based Practice guidelines and SAMHSA’s GPRA requirements both push toward outcome measurement with credible methodology.

Programs that report only pre-post improvements without addressing alternative explanations are increasingly vulnerable. Peer reviewers on continuation applications will ask about comparison groups. Federal project officers will ask what else changed during the grant period. External evaluators hired to assess program portfolios will classify pre-post evidence as weak. The program may have produced genuine impact, but if the evaluation design cannot demonstrate it, the impact is invisible to the people who make funding decisions.

The practical implication: evaluation design is not a Year 3 activity. The choice of method, the identification of comparison groups or the establishment of baseline time series, and the data collection plan must be built into the program from the start. A program that decides at month 30 to conduct a difference-in-differences analysis but never identified a comparison group or collected comparison data has foreclosed its best option. The CDC evaluation framework is explicit: evaluation planning should occur during program design, not after implementation.


The Product Owner Lens

What is the funding/compliance/execution problem? Programs claim outcomes they cannot credibly attribute to the intervention, producing evaluation evidence that does not survive scrutiny and weakens continuation applications.

What mechanism explains the operational bottleneck? Causal attribution requires a counterfactual that is never directly observed and must be constructed through evaluation design decisions made before the program launches. Most programs do not make these decisions, defaulting to pre-post comparison that conflates program effects with secular trends, confounders, and regression to the mean.

What controls or workflows improve it? Require evaluation design specification at the grant application stage. Identify comparison groups or establish baseline time series before intervention launch. Build data collection for both program and comparison conditions into operational workflows.

What should software surface? Outcome trend visualization with pre-intervention trajectory extrapolated forward — showing the gap between “what happened” and “what would have happened at the pre-existing rate.” Side-by-side comparison group tracking when DiD design is in use. Automated flagging of confounding events (staffing changes, policy changes, system transitions) that occur during the program period and must be addressed in the evaluation narrative. Time series data sufficiency indicator — does the program have enough pre-period data points for ITS analysis?

What metric reveals risk earliest? Baseline data adequacy at program launch. If the program has fewer than 8 pre-intervention time points for its primary outcome, the ITS option is foreclosed. If no comparison group has been identified by month 3, the DiD option is foreclosed. These are evaluation capacity indicators that predict, at program start, whether the final evaluation will be able to answer the attribution question.


Warning Signs

The progress report uses only pre-post numbers. If every outcome is reported as “X increased from baseline of Y to current value of Z,” with no discussion of what else changed during the period, the evaluation cannot support a causal claim.

No comparison group was ever identified. If the evaluation plan names no comparison sites, no comparison populations, and no external trend data, the program has no counterfactual. Every reported improvement is unattributed.

The baseline was measured during an anomalous period. If the pre-period includes a pandemic trough, a staffing crisis, or a system migration, the baseline is artificially depressed. Apparent improvement is partly recovery, and the evaluation should acknowledge this.

The evaluation was designed after the program ended. Retrospective evaluation design cannot recover data that was never collected. If comparison group data, pre-period time series, or confounding event logs were not captured prospectively, the strongest methods are unavailable.

Evaluators report only favorable findings. Human Factors Module 4 describes confirmation bias: the tendency to seek, interpret, and recall information that confirms prior beliefs. Evaluators who are hired by the program, embedded in the program, or invested in the program’s success are susceptible to this bias. They may unconsciously select comparison periods, comparison groups, or analytical specifications that produce the most favorable estimates. Independent evaluation — or at minimum, pre-registered analysis plans that specify methods before results are seen — is the control.


Integration Hooks

Human Factors Module 4 (Confirmation Bias and Decision Science). Confirmation bias is the evaluator’s occupational hazard. Program staff who designed the intervention and spent three years implementing it are psychologically invested in finding that it worked. This investment shapes every evaluation decision: which baseline period to use, which comparison group to select, which outlier to exclude, which analytical specification to report. Shadish, Cook, and Campbell (2002) identify “researcher expectancy” as a threat to construct validity. The mechanism is the same one HF M4 describes for clinical decision-making — prior beliefs weight the interpretation of ambiguous evidence. The control is structural: pre-specification of the analysis plan, independent evaluation, or at minimum, sensitivity analysis that shows how results change under different reasonable assumptions.

Operations Research Module 6 (Simulation and Monte Carlo Methods). When natural comparison groups do not exist and pre-period data is insufficient for time series analysis, simulation offers an alternative path to the counterfactual. Monte Carlo methods can model the expected outcome trajectory under specified assumptions about secular trends, seasonal patterns, and known confounders — generating a synthetic counterfactual distribution rather than a point estimate. This does not replace quasi-experimental evidence, but it provides a principled way to bound the program effect when stronger designs are infeasible. The simulation approach is particularly valuable for small rural programs where comparison sites with similar characteristics may not exist and where limited patient volumes make statistical methods underpowered.


Key Frameworks and References

  • Shadish, Cook, and Campbell (2002), Experimental and Quasi-Experimental Designs for Generalized Causal Inference — the definitive reference on threats to internal validity and evaluation design for causal attribution
  • Angrist and Pischke (2009), Mostly Harmless Econometrics — accessible treatment of the counterfactual framework, difference-in-differences, instrumental variables, and regression discontinuity
  • Rosenbaum (2002), Observational Studies — foundational framework for propensity score methods and sensitivity analysis for hidden bias
  • CDC Framework for Program Evaluation in Public Health (1999) — establishes evaluation planning as concurrent with program design, not a post-hoc activity
  • Biglan et al. (2000), “The Integration of Research and Practice in the Prevention of Youth Problem Behaviors” — demonstrates feasibility of community-level experimental and quasi-experimental designs
  • Cochrane EPOC (Effective Practice and Organisation of Care) — guidelines for interrupted time series design, including minimum data point requirements
  • W.K. Kellogg Foundation Logic Model Development Guide (2004) — connects program theory to evaluation design; the logic model specifies the causal chain that attribution analysis tests
  • 2 CFR 200.301 — requires relating financial data to performance accomplishments; credible attribution strengthens this linkage
  • SAMHSA GPRA (Government Performance and Results Act) Measures — federal performance measurement framework that increasingly expects outcome-level evidence