Logic Models

Module 5: Program Evaluation and Outcomes Depth: Foundation | Target: ~2,500 words

Thesis: A logic model is the program’s theory of causation — inputs lead to activities, activities to outputs, outputs to short-term outcomes, short-term to long-term outcomes — and a program without an explicit logic model has an implicit one that no one has examined or tested.

The Operational Problem

A federally qualified health center in rural eastern Washington receives a three-year, $1.8M SAMHSA Primary and Behavioral Health Care Integration grant. The program’s goal: integrate behavioral health services into primary care to reduce emergency department utilization for behavioral health crises. The proposal narrative describes the plan clearly — hire behavioral health providers, screen patients for depression and substance use, provide brief interventions, and refer complex cases to specialty care. The funder awards the grant. The FQHC hires staff, configures the EHR, and begins screening.

Eighteen months later, screening rates are at 72%. The program has screened over 4,000 patients. It has identified 680 patients with positive PHQ-9 scores. It has documented 210 brief interventions. And ED utilization for behavioral health crises has not changed.

The program director reviews the numbers and cannot explain why. The screening is working. The identification is working. But the downstream effect — the reason the program was funded — is not materializing. The funder’s project officer asks a question the program director has not considered: “What is your theory for how screening leads to reduced ED utilization?”

The answer, when the team works through it, reveals three unstated assumptions. First, that patients with positive screens would accept a warm handoff to a behavioral health provider. (They did — at a 31% rate.) Second, that brief interventions would be sufficient for the severity levels being identified. (The average PHQ-9 score among positive screens was 17, indicating moderately severe depression — above the threshold where brief intervention alone is effective.) Third, that patients who needed specialty behavioral health care would successfully connect to it. (The referral completion rate to external providers was 22%, with an average wait time of 47 days.)

These three assumptions were the program’s theory of change. They were never stated, never diagrammed, never tested. They were embedded in the proposal narrative as implicit links between activities and outcomes. The logic model — had one been built and interrogated — would have made each assumption visible as a testable hypothesis. The program would have known, before spending $1.2M on 18 months of operations, that the causal chain had three weak links. It could have designed the program to strengthen those links or adjusted the theory entirely.

This is the problem a logic model solves. Not retroactively. Prospectively.

What a Logic Model Is

A logic model is a visual representation of a program’s theory of change — a diagram that maps the causal chain from what the program invests to what the program ultimately achieves. It is not a description of the program. It is a hypothesis about how the program works. Every arrow in the model is a claim: “this leads to that, and here is why.”

The W.K. Kellogg Foundation Logic Model Development Guide (2004) — the canonical reference in program evaluation — defines the logic model as “a systematic and visual way to present and share your understanding of the relationships among the resources you have to operate your program, the activities you plan, and the changes or results you hope to achieve.” The CDC Framework for Program Evaluation in Public Health (1999) incorporates logic models as a core element of evaluation design, embedding them in Step 2 (“describe the program”) as prerequisite to all subsequent evaluation steps.

The critical distinction is between a logic model as description and a logic model as hypothesis. A descriptive logic model lists what the program does and what it hopes to achieve. A hypothesis-driven logic model specifies the causal mechanism connecting each stage to the next — and treats each connection as a claim that can be examined, challenged, and tested against evidence. Carol Weiss’s formulation of theory-based evaluation (1998) makes this explicit: the evaluator’s job is not merely to determine whether the program produced outcomes, but to understand the causal mechanism — the “theory of change” — that connects program activities to those outcomes. Huey-Tsyh Chen’s theory-driven evaluation (1990) extends this further, arguing that programs should be designed and evaluated based on an explicit program theory that specifies both the intervention mechanisms and the conditions under which those mechanisms operate.

A program without an explicit logic model has an implicit one. The FQHC described above had a theory of change: screen patients, identify problems, intervene, and ED utilization will decrease. That theory was never drawn, never examined, never stress-tested. It lived in the proposal narrative as a series of optimistic transitions between activities and outcomes. The logic model’s function is to drag that implicit theory into the open where it can be scrutinized.

The Five Components

The standard logic model contains five components, arranged left to right in a causal sequence. Each component must be specific enough to measure and concrete enough to test.

Inputs are the resources the program invests: funding, staff, facilities, equipment, partnerships, existing infrastructure, and organizational capacity. Inputs are not just what the grant provides. They include what the organization brings to the program — the pre-existing EHR system, the primary care workforce, the community partnerships, the institutional knowledge. A logic model that lists only grant-funded inputs ignores the organizational substrate the program depends on. When that substrate is weak — inadequate EHR configuration, insufficient primary care panel capacity, no existing behavioral health workflow — the program fails not because the grant inputs were insufficient but because the pre-existing inputs were not accounted for.

Activities are what the program does with those inputs: screen patients, train providers, configure EHR workflows, establish referral pathways, provide clinical services. Activities are the operational core of the program — the verbs. They must be specific enough that an observer could confirm whether they are occurring. “Improve care coordination” is not an activity. “Implement weekly multidisciplinary case conferences with primary care, behavioral health, and care management staff” is an activity. The test: could someone walk into the clinic and see this happening?

Outputs are the direct, countable products of activities: number of patients screened, number of providers trained, number of brief interventions delivered, number of referrals made. Outputs measure volume and reach. They answer the question “how much did we do?” Outputs are necessary but not sufficient. A program that produces impressive outputs but no outcomes has been busy without being effective. The confusion between outputs and outcomes is the single most common logic model error — and it is not a semantic quibble. “Trained 200 providers” is an output. “Screening rate increased from 15% to 72%” is a short-term outcome. The difference is the difference between counting activity and measuring change. SAMHSA’s Government Performance and Results Act (GPRA) measures explicitly require grantees to report at the outcome level, not merely the output level — a regulatory acknowledgment that outputs alone do not demonstrate program value.

Short-term outcomes are the immediate changes the program produces: changes in knowledge, behavior, clinical practice, or system performance that result from program activities. These are the first evidence that the program’s theory of change is working. Increased screening rates, reduced time-to-behavioral-health-access, improved PHQ-9 scores at follow-up, higher referral completion rates. Short-term outcomes should be observable within the first 12-18 months of a multi-year program. They are the leading indicators that predict whether long-term outcomes will materialize.

Long-term outcomes (sometimes labeled “impact”) are the ultimate goals: reduced ED utilization for behavioral health crises, improved chronic disease management for patients with comorbid behavioral health conditions, reduced total cost of care, improved population-level health metrics. Long-term outcomes typically require two to five years to materialize, often extending beyond the grant period. They are influenced by many factors beyond the program, making causal attribution difficult (a problem addressed in the companion page on causal attribution, 05-causal-attribution.md). The logic model must connect short-term outcomes to long-term outcomes through a plausible causal pathway — not a direct leap from activity to impact.

The Testable-Linkage Requirement

Each arrow in a logic model is a causal hypothesis. The arrow between “train primary care providers in PHQ-9 screening” and “increased screening rates” claims that provider knowledge is the barrier to screening. If the actual barrier is workflow design — if providers know how to administer the PHQ-9 but the visit workflow does not include a screening step, or the EHR does not have a discrete field for the score, or the medical assistant has no prompt to administer the instrument — then training will not increase screening rates regardless of its quality.

This is the testable-linkage requirement: every arrow must be accompanied by a stated assumption about why the upstream component produces the downstream component, and that assumption must be testable. The logic model is not a flowchart of hope. It is a chain of hypotheses, each subject to disconfirmation.

Weiss (1998) calls this the “implementation theory” — the theory about what activities will be implemented and how — combined with the “program theory” — the theory about how those activities will generate change. The two theories are distinct. A program can have a correct program theory (screening plus intervention does reduce crisis utilization) but a failed implementation theory (the screening workflow was never integrated into routine practice). The logic model must diagram both.

Testing the linkages means asking, for each arrow: What must be true for this connection to hold? What could prevent this connection from working? What evidence would tell us whether this connection is functioning?

For the FQHC behavioral health integration example:

Inputs to activities. Does the program have enough BH providers to handle the volume of positive screens? If the screening rate is 72% of a panel of 8,000 patients, and the positive rate is 17%, the program is identifying roughly 980 patients who need follow-up. Two behavioral health providers with 30 clinical hours per week and 45-minute appointments can see approximately 80 patients per week, or 4,160 patient-visits per year. If each identified patient requires an average of 4 visits, the panel demand is 3,920 visits — 94% of available capacity. This leaves no buffer for no-shows, urgent cases, or administrative time. The arrow from “hire 2 BH providers” to “provide brief interventions to all positive screens” assumes sufficient capacity. The math says it is barely sufficient under ideal conditions and insufficient under realistic ones.
Activities to outputs. Does screening actually lead to assessment? Only if the workflow supports a warm handoff. If the patient screens positive during a primary care visit and receives a printed referral to call a behavioral health number, the conversion rate from positive screen to assessment will be low — typically 20-35% in the literature. If the workflow includes a same-visit warm handoff where the primary care provider introduces the patient to a behavioral health provider who is physically present in the clinic, the conversion rate rises to 60-80% (Woltmann et al., 2012). The arrow from screening to assessment assumes a specific handoff mechanism. The logic model should name that mechanism.
Outputs to short-term outcomes. Do brief interventions produce symptom improvement for the severity levels being identified? Brief interventions (1-3 sessions of structured problem-solving or motivational interviewing) are effective for mild-to-moderate depression (PHQ-9 scores 5-14) but show limited efficacy for moderately severe to severe depression (PHQ-9 scores 15+) without concurrent pharmacotherapy or ongoing therapy (Bower et al., 2006). If the program’s identified population has an average PHQ-9 of 17, the logic model’s arrow from “brief intervention” to “improved PHQ-9 scores” assumes a severity distribution that does not match the data. The program either needs to adjust its intervention model (add psychiatric consultation, extend treatment duration) or adjust its expected outcomes.
Short-term to long-term outcomes. Does improved behavioral health treatment reduce ED utilization? The evidence supports this connection, but with important moderators. The IMPACT trial (Unutzer et al., 2002) demonstrated that collaborative care reduces depression severity and improves functioning. Studies of integrated behavioral health in FQHCs show reductions in total healthcare utilization (Reiss-Brennan et al., 2016). But the connection requires sustained treatment engagement over months, not a single brief intervention. The arrow from “improved PHQ-9 scores” to “reduced ED utilization” assumes treatment persistence and dose adequacy that a brief-intervention-only model may not provide.

Each of these linkage tests reveals a potential failure point. The logic model, by making the causal chain explicit, makes these failure points visible before the program discovers them operationally.

Common Failure Patterns

Decorative logic models. The logic model is created for the grant application, included in the proposal appendix, and never referenced again. It was drawn to satisfy a funder requirement, not to guide program design. The boxes contain the right categories. The arrows point in the right direction. But no one on the implementation team has examined the assumptions behind the arrows. The decorative logic model is a compliance artifact. It exists on paper but not in practice.

Skipped steps. The logic model jumps from inputs directly to long-term outcomes with no intermediate mechanism. “Funding + Staff” leads to “Improved Community Health” with nothing in between. This model cannot be tested because it makes no specific claim about how the transformation occurs. It is a wish, not a hypothesis. The Kellogg guide specifically warns against this pattern: “If you find it difficult to articulate the connections between your program components, it may be a sign that the connections are assumed rather than demonstrated.”

Breadth without specificity. The long-term outcome is “improve community health” or “reduce health disparities.” These are real goals, but they are too broad to be falsifiable within a program evaluation. A logic model must specify which dimension of health, for which population, measured how, changing by how much, over what period. “Reduce emergency department utilization for behavioral health crises by 25% among enrolled patients within 36 months, measured by claims data comparison to a pre-enrollment baseline” is a testable outcome. “Improve community health” is not.

Output-outcome confusion. The logic model lists “trained 200 providers” as an outcome. It is an output — a product of the training activity. The outcome is what changed because 200 providers were trained: screening behavior changed, diagnostic accuracy improved, referral patterns shifted. If training 200 providers produces no change in their clinical behavior, the output was achieved and the outcome was not. Programs that conflate the two report impressive output numbers while the outcomes they were funded to produce remain unmeasured.

Single-theory models. The logic model identifies one causal pathway and treats it as the only mechanism. A behavioral health integration program focuses exclusively on the clinical pathway (screen, assess, treat) while ignoring the organizational pathway (workflow redesign, scheduling changes, space allocation) and the patient pathway (engagement, retention, follow-up). Real programs operate through multiple parallel mechanisms, and failure in any one can undermine the others. A logic model with a single causal chain will miss the failure modes that originate in the chains it did not draw.

Healthcare Example: Full Logic Model Construction

Consider the SAMHSA-funded behavioral health integration program at the FQHC described above. The complete logic model:

Inputs: $1.8M SAMHSA grant funding over 3 years. 2 licensed behavioral health providers (LCSW). 0.5 FTE psychiatric consultant. EHR with configurable screening templates. Clinical protocols for PHQ-9/AUDIT-C screening, warm handoff, brief intervention, and specialty referral. Existing primary care workforce of 8 providers and 12 nursing/MA staff across 2 clinic sites. Community behavioral health referral network (3 agencies). Data analyst (0.5 FTE).

Activities: Screen all primary care patients ages 12+ for depression (PHQ-9) and substance use (AUDIT-C) at annual and new-patient visits. Assess all patients with positive screens using validated instruments and clinical interview. Provide brief interventions (1-4 sessions) for mild-to-moderate presentations. Provide warm handoff for same-day behavioral health contact. Refer patients with severe presentations to specialty behavioral health. Consult with psychiatric provider for medication management. Track referrals to completion. Conduct quarterly case conferences across primary care and behavioral health teams.

Outputs: Number of patients screened. Number of positive screens identified. Number of patients assessed following positive screen. Number of brief interventions delivered. Number of warm handoffs completed. Number of specialty referrals made. Number of psychiatric consultations. Number of case conferences held.

Short-term outcomes (6-18 months): PHQ-9 screening rate increases from 12% to 70%. Time from positive screen to behavioral health contact decreases from 28 days to same-day. Warm-handoff acceptance rate reaches 65%. Brief intervention completion rate reaches 80% for enrolled patients. PHQ-9 scores improve by 5+ points for 50% of patients with 6-month follow-up. AUDIT-C positive patients connected to treatment within 14 days.

Long-term outcomes (18-36 months): ED utilization for behavioral health crises decreases by 25% among enrolled patients. Inpatient behavioral health admissions decrease by 15%. Chronic disease management metrics (HbA1c, blood pressure control) improve for patients with comorbid behavioral health conditions. Total cost of care decreases for enrolled patients relative to matched comparison group.

Now test the arrows:

Does screening lead to assessment? Only if the workflow supports warm handoff. The model assumes same-day BH contact, which requires that a behavioral health provider is on-site and has open appointment slots at the time of the positive screen. If BH providers are scheduled with back-to-back appointments, no warm-handoff capacity exists. The arrow requires operational slack in BH scheduling — at least 4-6 open slots per day per provider reserved for warm handoffs.

Does assessment lead to treatment? Only if BH providers have panel capacity. With 980 expected positive screens per year and an average treatment course of 4 sessions, the program needs 3,920 treatment visits per year. Two providers at 30 clinical hours per week can provide approximately 4,160 visits — if no-show rates are zero, administrative burden is zero, and every patient needs exactly the average number of sessions. Real-world no-show rates in FQHC behavioral health are 20-30%. Effective capacity is closer to 2,900-3,300 visits. The arrow from assessment to treatment assumes capacity that the staffing model may not provide.

Does treatment reduce ED utilization? Only if patients with the highest ED utilization for behavioral health crises are actually the patients being identified and treated by the program. If the frequent ED utilizers are patients who do not attend primary care (and therefore are never screened), the entire upstream chain — no matter how well it functions — will not affect the downstream outcome. The logic model must specify the mechanism by which the clinical pathway reaches the population whose ED utilization the program intends to reduce.

Each of these tests generates a design decision. The logic model is not a reporting artifact. It is an engineering tool.

The Product Owner Lens

What is the funding/compliance/execution problem? Programs operate on unstated theories of change that are never tested until outcomes fail to materialize — at which point most of the grant period and budget have been consumed. Funders increasingly require logic models, but most are decorative rather than functional.

What mechanism explains the operational bottleneck? The causal assumptions linking activities to outcomes are implicit, distributed across proposal narratives and staff mental models, and never consolidated into a testable structure. Without explicit linkages, the program cannot identify which assumptions are weakest or where intervention is needed until outcomes data — which arrives late — reveals the failure.

What controls or workflows improve it? Build the logic model at program design, not as a proposal appendix. Assign each arrow an assumption statement and a test. Review the logic model quarterly with outcome data mapped to each linkage. When a linkage is not functioning (screening is high but warm-handoff rates are low), treat it as a design problem requiring intervention, not a performance problem requiring exhortation.

What should software surface? A logic model visualization with real-time data mapped to each component: input status (staff hired, equipment procured), activity volume (screenings this month), output counts, and outcome trend lines. Arrow-level health indicators showing conversion rates at each linkage (screen-to-assessment rate, assessment-to-treatment rate, treatment-to-improvement rate). Alerts when a conversion rate drops below threshold — indicating a broken linkage that will prevent downstream outcomes regardless of upstream volume.

What metric reveals risk earliest? The conversion rate at each arrow in the logic model. A program screening 500 patients per month with a 31% warm-handoff rate has a broken linkage between screening and assessment. That 31% is visible in month 3. The ED utilization outcome will not be visible for 18-24 months. The conversion rate at the weakest linkage is the earliest signal that the program’s theory of change is not functioning — and it is computable from operational data the program is already collecting.

Warning Signs

The logic model was drawn once and never referenced. If the program team cannot locate their logic model or has not discussed it since the proposal was submitted, the model is decorative. The implicit theory of change has diverged from whatever was drawn, and no one is tracking whether the causal assumptions hold.

Outputs are reported as outcomes. If the program’s progress reports highlight “screened 4,000 patients” and “delivered 210 brief interventions” without reporting whether screening rates, symptom scores, or utilization patterns changed, the program is measuring activity, not impact. The logic model, if it exists, has the wrong components in the outcome boxes.

No one can articulate the mechanism. Ask the program director: “Why do you believe that screening patients will reduce ED utilization?” If the answer is a restatement of the goal (“because that’s what we’re trying to do”) rather than a causal pathway (“because screening identifies patients in crisis before they reach the ED, and connecting them to treatment reduces the crisis episodes that drive ED visits”), the theory of change has not been examined.

The same logic model is used for a different population or setting. A logic model developed for an urban FQHC with co-located behavioral health providers and robust public transit is not valid for a rural clinic where patients drive 45 minutes, behavioral health providers are available by telehealth only, and broadband reliability is variable. The causal assumptions are population- and setting-specific. A transplanted logic model carries untested assumptions from its origin context.

The arrows skip steps. If the logic model connects “hire staff” directly to “reduce ED utilization” without specifying the intermediate mechanism, three to five causal assumptions are hidden in that single arrow. Each hidden assumption is a potential failure point that no one is monitoring.

Integration Hooks

Human Factors Module 4 (Heuristics and Biases). Confirmation bias is the primary threat to logic model integrity. The team that designed the program believes in its theory of change — that is why they proposed it. When early data arrives, confirmation bias operates on two fronts. First, the team attends to data that supports the theory (high screening rates) and discounts data that challenges it (low warm-handoff rates). Second, the team interprets ambiguous evidence in favor of the existing theory: a 31% warm-handoff rate is framed as “we’re getting started” rather than “our core mechanism is not functioning.” The debiasing strategies from HF M4 apply directly: pre-commit to specific falsification criteria (“if warm-handoff rate is below 50% at month 9, we will redesign the workflow rather than wait for improvement”), assign a team member the explicit role of testing the arrows rather than defending them, and structure quarterly reviews around linkage-level data rather than aggregate output counts. Teams rarely test the arrows they believe in most — and those are the arrows most likely to contain the failure.

Operations Research Module 6 (Simulation). A logic model specifies the causal chain; simulation can test whether the chain will function under realistic operating conditions before the program deploys. The capacity question in the FQHC example — whether 2 BH providers can handle the patient volume generated by a 70% screening rate — is a queueing and simulation problem. A discrete-event simulation that models patient flow from screening through warm handoff through treatment, with realistic no-show rates, appointment durations, and provider scheduling constraints, can identify capacity bottlenecks and broken linkages before they manifest in operations. The logic model provides the structure; the simulation provides the quantitative stress test. A logic model that has survived simulation testing is a fundamentally different planning tool than one that has only survived narrative review.

Key Frameworks and References

W.K. Kellogg Foundation Logic Model Development Guide (2004) — the standard reference for logic model construction; defines the five-component model (inputs, activities, outputs, outcomes, impact) and provides templates for program design and evaluation
CDC Framework for Program Evaluation in Public Health (1999) — embeds logic models in Step 2 of the six-step evaluation framework; establishes utility, feasibility, propriety, and accuracy as evaluation standards
Weiss, C.H. (1998), Evaluation: Methods for Studying Programs and Policies — foundational text on theory-based evaluation; distinguishes implementation theory from program theory and argues that evaluations must test the causal mechanisms, not merely the outcomes
Chen, H.T. (1990), Theory-Driven Evaluations — argues that program evaluation should be guided by explicit program theory; introduces the distinction between normative theory (what should happen) and causative theory (what mechanism produces the outcome)
Wholey, J.S., Hatry, H.P., & Newcomer, K.E. (Eds.), Handbook of Practical Program Evaluation — comprehensive reference on evaluation methods including logic model development, indicator selection, and quasi-experimental design
SAMHSA GPRA (Government Performance and Results Act) Measures — federal performance measurement requirements distinguishing output-level from outcome-level reporting
Unutzer, J. et al. (2002), Collaborative Care Management of Late-Life Depression (IMPACT Trial) — landmark RCT demonstrating that collaborative care improves depression outcomes; the evidentiary basis for many BH integration logic models
Woltmann, E. et al. (2012) — meta-analysis of collaborative care models; provides evidence on warm-handoff conversion rates and collaborative care effectiveness
Bower, P. et al. (2006) — Cochrane review of collaborative care for depression; establishes the severity thresholds at which brief intervention is and is not sufficient
Reiss-Brennan, B. et al. (2016) — Intermountain Healthcare study demonstrating that integrated behavioral health reduces total cost of care