Simulation Foundations
Module 6: Simulation and Scenario Analysis Depth: Foundation | Target: ~2,500 words
Thesis: Simulation is the analytical method of last resort and first necessity — used when the system is too complex for closed-form solutions, which is most real healthcare systems.
The Operational Problem
A regional health system in central Oregon is considering consolidating two emergency departments — one at a 25-bed critical access hospital, one at a 120-bed community hospital 35 miles away — into a single, larger ED at the community hospital. The case looks straightforward on paper: combined volume supports better specialty coverage, eliminates duplicated on-call costs, and improves the case mix for trauma designation. Leadership asks the operations team to model the impact.
The operations team reaches for queueing theory. But the problem resists closed-form treatment. Ambulance routing is distance-dependent and time-varying: daytime BLS crews cover different zones than nighttime ALS units. Triage acuity at the two sites follows different distributions because the catchment populations differ. Boarding delays in the combined ED depend on inpatient census, which depends on surgical scheduling, which varies by day of week. Transfer patterns from the CAH currently absorb 15% of the community hospital’s low-acuity volume, and eliminating the CAH ED redirects that volume into the consolidated site at unpredictable times. Physician productivity degrades during overnight hours per the fatigue curves described in Human Factors Module 2. And a subset of the CAH’s catchment population — roughly 3,200 people in two small towns — will now face a 40-minute ambulance transport instead of a 12-minute one.
No queueing formula handles this. No linear program captures it. The interactions between ambulance zones, triage distributions, boarding dynamics, staffing patterns, and transport times are too numerous and too nonlinear. This is the problem simulation was built to solve.
When the team builds a discrete-event simulation and runs 500 replications across a simulated year, the results are not what the spreadsheet predicted. Consolidation reduces average wait-to-provider by 11 minutes across the combined population — a real gain. But it creates a 90-minute transport gap for the two small towns during overnight hours when ALS coverage shifts to the opposite zone. For STEMI and stroke patients in those towns, the additional transport time pushes them past treatment windows that directly affect mortality. The simulation quantifies this: under consolidation, the expected annual count of time-sensitive cases exceeding guideline thresholds increases from 2 to 9. That finding — invisible to any aggregate analysis, visible only through a model that tracks individual patient trajectories through a system with interacting stochastic components — changes the decision.
Why Analytical Methods Have Limits
The preceding modules in this discipline built a toolkit of closed-form methods. Queueing theory (Module 2) provides exact formulas for systems with Poisson arrivals, exponential service times, and steady-state behavior. Optimization (Module 3) finds provably best solutions when objectives and constraints are expressible as mathematical programs. Network flow (Module 4) computes max-flow and shortest paths on directed graphs. These methods are powerful precisely because they are exact: they produce answers that are provably correct given their assumptions.
The problem is the assumptions. Closed-form solutions require the system to be simple enough that mathematics can close on it. Each method has specific requirements:
Queueing formulas require stationary arrival rates, known service distributions (usually exponential or phase-type), independent servers, and a single queue or a tractable network of queues. Real healthcare systems violate every one of these. Arrivals are time-varying (ED volumes peak between 10am and 10pm, per Welch et al., 2011). Service times are rarely exponential — they are often lognormal or empirically irregular due to case-mix variation. Servers are not independent: a nurse assisting in a trauma resuscitation is unavailable for other patients, and that unavailability is correlated with the very demand spikes that stress the system. And the queues interact: ED boarding depends on inpatient discharges, which depend on hospitalist rounding patterns, which depend on census.
Optimization models require that the objective function and constraints be expressible in a tractable form — linear, integer, or convex. When the system involves stochastic elements (random patient arrivals, uncertain length of stay), optimization either requires expected-value approximations (which can badly misrepresent skewed distributions) or stochastic programming formulations that grow computationally expensive. More fundamentally, optimization finds the best static decision. It does not simulate the dynamic unfolding of a system over time.
Network models capture topology and steady-state flows but not the temporal dynamics of congestion, queue buildup, and time-dependent routing.
Simulation fills the gap. It does not require the system to be mathematically tractable. It requires only that the system’s components and their interactions be specifiable — that you can describe the rules governing arrivals, routing, service, and departure, even when those rules are complex, conditional, and time-varying. The simulation then runs the system forward in time, letting emergent behavior arise from the interaction of components rather than being derived from closed-form expressions.
This is why simulation is “the method of last resort and first necessity” — a characterization that echoes Jerry Banks, John Carson, Barry Nelson, and David Nicol in their foundational textbook Discrete-Event System Simulation (5th edition, 2010). Last resort because you should use an analytical method when one exists: it is faster, cheaper, and produces provably exact results. First necessity because most real systems — certainly most real healthcare delivery systems — exceed the assumptions of analytical methods within the first ten minutes of honest modeling.
The Simulation Taxonomy
Four simulation paradigms serve different analytical purposes. They are not interchangeable, and choosing the wrong one wastes months.
Discrete-Event Simulation (DES)
DES models a system as a sequence of events — arrivals, service starts, transfers, departures — that change the system’s state at discrete points in time. Between events, nothing changes. Entities (patients, orders, referrals) move through a network of resources (beds, providers, rooms), competing for service according to defined rules. Time advances from event to event, skipping idle intervals entirely.
DES is the workhorse of healthcare operations modeling. Its entity-centric structure maps naturally to patient flow: a patient arrives at triage, waits for a bed, is assessed by a nurse, waits for a physician, receives treatment, and is either discharged or admitted. Each step has a duration drawn from a probability distribution fitted to real data. Each resource has a finite capacity. The emergent behavior — wait times, throughput, resource utilization, bottleneck locations — arises from the interaction of thousands of individual entity trajectories.
When to use DES: Operational flow problems where individual entity trajectories matter. ED patient flow, surgical suite scheduling, infusion center throughput, outpatient clinic capacity, care transitions between units. Any system where you need to track how specific patients (or classes of patients) experience the system over time.
Canonical reference: Banks, Carson, Nelson, and Nicol, Discrete-Event System Simulation. Averill Law’s Simulation Modeling and Analysis (6th edition, 2024) provides the statistical foundations for output analysis.
Monte Carlo Simulation
Monte Carlo methods generate thousands or millions of random samples from specified probability distributions to estimate outcomes that are too complex for analytical computation. Unlike DES, Monte Carlo simulation does not model a process unfolding over time. It models uncertainty in a calculation.
In healthcare, Monte Carlo is the right tool for financial risk analysis, demand forecasting under uncertainty, and any problem that asks “given what we don’t know, what is the range of likely outcomes?” A grant program with uncertain enrollment, variable per-member cost, and a fixed budget can be Monte Carlo simulated: draw enrollment from a distribution, draw per-member cost from a distribution, multiply, and repeat 10,000 times. The result is not a point estimate (“we expect to spend $2.1M”) but a probability distribution (“there is a 15% chance we exceed $2.4M and trigger a budget shortfall”).
When to use Monte Carlo: Risk quantification, sensitivity analysis, financial modeling under uncertainty, probabilistic cost estimation. Any problem where the question is “what is the distribution of possible outcomes?” rather than “how does the process unfold?”
Canonical reference: The term originates with Stanislaw Ulam and John von Neumann’s work at Los Alamos in the 1940s. Nicholas Metropolis named the method. Rubinstein and Kroese, Simulation and the Monte Carlo Method (3rd edition, 2017), is the modern standard.
System Dynamics (SD)
System dynamics models aggregate stocks and flows rather than individual entities. Developed by Jay Forrester at MIT in the 1950s (published in Industrial Dynamics, 1961), SD represents systems as interconnected stocks (accumulations: patients waiting, beds occupied, staff employed) and flows (rates: admission rate, discharge rate, hiring rate) governed by feedback loops.
SD excels at policy-level analysis over long time horizons. A workforce planning model that tracks nurse supply (stock), hiring and attrition (flows), training pipeline delays, and the feedback between workload, burnout, and turnover is naturally a system dynamics model. The individual nurse does not matter; the aggregate dynamics of the workforce pool do.
When to use SD: High-level policy analysis, workforce planning, epidemic modeling, long-horizon strategic questions. Problems where individual entity trajectories are irrelevant but aggregate accumulations and feedback dynamics drive outcomes.
Canonical reference: Forrester, Industrial Dynamics; Sterman, Business Dynamics: Systems Thinking and Modeling for a Complex World (2000).
Agent-Based Modeling (ABM)
Agent-based models populate the simulation with autonomous agents — individual entities with rules governing their behavior, perception, and interaction. Unlike DES entities, which follow predefined routes through a process, agents make decisions based on local information and learned or programmed rules. Emergent system-level patterns arise from agent-to-agent interaction.
In healthcare, ABM is appropriate when individual behavior and social dynamics drive outcomes: disease transmission modeling (agents infect other agents based on proximity and behavior), patient choice modeling (agents select providers based on distance, wait time, insurance, and word-of-mouth), or care-seeking behavior (agents decide whether to present to the ED, urgent care, or primary care based on symptom severity, time of day, and prior experience).
When to use ABM: Problems where heterogeneous individual behavior and agent-to-agent interaction drive system outcomes. Disease spread, patient choice and navigation, provider practice pattern variation, organizational behavior.
Canonical reference: Epstein and Axtell, Growing Artificial Societies (1996); Macal and North, “Tutorial on Agent-Based Modelling and Simulation” (2005).
The Simulation Workflow
Simulation that produces trustworthy results follows a disciplined workflow. Stewart Robinson’s Simulation: The Practice of Model Development and Use (2014) provides the most practitioner-accessible treatment of this process. The stages are:
1. Conceptual modeling. Define the problem, the system boundary, the level of detail, and the assumptions before writing any code. Robinson’s conceptual modeling framework emphasizes that this is the highest-leverage stage: a simulation of the wrong system, at the wrong level of detail, produces sophisticated garbage. The conceptual model should answer: What decision does this simulation support? What entities flow through the system? What resources constrain them? What outputs will distinguish between alternatives? What can we safely leave out?
2. Data collection and input modeling. Identify the probability distributions that drive the simulation: inter-arrival times, service durations, routing probabilities, resource availability schedules. This requires fitting distributions to empirical data — not assuming exponential because it is convenient. Law and Kelton’s input modeling methodology prescribes goodness-of-fit testing (chi-square, Kolmogorov-Smirnov, Anderson-Darling) to validate distributional assumptions. In healthcare, the most common input modeling failures are using mean values instead of distributions (which eliminates the variability that drives system behavior) and using system-wide averages that mask patient-class differences (a blended ED service time average that combines ESI-2 traumas and ESI-5 lacerations is useless).
3. Model building. Implement the conceptual model in simulation software. Common platforms include AnyLogic, Simul8, Arena, and FlexSim for DES; Vensim and Stella for system dynamics; NetLogo and AnyLogic for ABM. The choice of tool matters less than the fidelity to the conceptual model.
4. Verification. Confirm that the computer model implements the conceptual model correctly. Does the code do what you intended? This is debugging in the software engineering sense — tracing entity paths, checking resource logic, validating that distributions generate the intended random variates. Verification asks: “Did we build the model right?”
5. Validation. Confirm that the model adequately represents the real system for the purpose at hand. Run the simulation with historical inputs and compare outputs to observed system performance. If the model predicts mean ED wait times of 35 minutes and the real system averages 34, the model has face validity for wait-time analysis. If it predicts 60 minutes, something is structurally wrong. Validation asks: “Did we build the right model?” Sally Brailsford, in her review of healthcare simulation (Brailsford et al., “An Analysis of the Academic Literature on Simulation and Modelling in Health Care,” Journal of Simulation, 2009), noted that validation is the step most frequently skipped in healthcare simulation studies — and the step whose absence most thoroughly undermines the results. An unvalidated simulation is a random number generator with a narrative.
6. Experimentation. Run the simulation under alternative scenarios — different staffing levels, different routing rules, different demand assumptions — and compare outputs. Each scenario requires multiple replications (typically 30-500, depending on output variability) to generate statistically meaningful comparisons. Output analysis follows the methods in Law’s textbook: confidence intervals for mean performance, warm-up period analysis to exclude transient behavior, and common random numbers to reduce variance when comparing alternatives.
7. Analysis and communication. Translate simulation outputs into decision-relevant findings. This is where simulation returns to the operational problem that motivated it. The output is not “the simulation says wait times are 27 minutes.” The output is “consolidating the two EDs reduces average wait-to-provider by 11 minutes but creates 7 additional time-critical transport exceedances per year in the CAH catchment — here is the tradeoff the leadership team needs to evaluate.”
Warning Signs of Bad Simulation
Simulation is seductive because it produces numbers that look precise. The following patterns indicate a simulation that is generating confident fiction:
No validation against historical data. If the simulation has never been compared to real system performance, its outputs are uncalibrated. It may be structurally correct, structurally wrong, or anything in between. There is no way to know.
Mean-value inputs. A simulation fed average arrival rates, average service times, and average resource availability is a deterministic model wearing a stochastic disguise. It will systematically underestimate wait times, queue lengths, and resource contention because it eliminates the variability that creates those phenomena. The utilization-delay curve (Module 2) demonstrates why: at 85% utilization, variability doubles expected wait time relative to the deterministic prediction.
Single replication. One run of a stochastic simulation is one sample from a distribution. Drawing conclusions from a single replication is equivalent to estimating a population mean from a sample of n=1. The result is literally a random number. Multiple replications with statistical analysis of the output distribution are not optional.
Excessive detail without purpose. A simulation that models the color of the waiting room chairs is not more accurate than one that omits this detail — it is slower to build, harder to validate, and more likely to contain errors. Every detail in a simulation should be traceable to the decision it supports. Robinson calls this the principle of parsimony: include only what matters for the stated purpose.
No sensitivity analysis. If you have not tested how outputs change when inputs vary, you do not know whether your results are robust or an artifact of one specific set of assumptions. Key inputs — arrival rates, service times, resource counts — should be perturbed to identify which ones the conclusions depend on.
Simulation as the Integration Method
Simulation occupies a unique position in the CapabilityGraph because it is the only analytical method that can combine insights from multiple disciplines into a single working model. A DES model of an ED can simultaneously incorporate:
- Queueing dynamics (Module 2): patients arriving stochastically, waiting for resources, being served with variable durations
- Optimization constraints (Module 3): staffing levels determined by an optimization model, shift schedules subject to labor constraints
- Network topology (Module 4): patient routing through a referral network, transfer protocols between facilities
- Scheduling rules (Module 5): appointment templates, block schedules, provider shift patterns
- Human factors (Human Factors discipline): fatigue-dependent service time degradation, error rates that increase with cognitive load, handoff failure probabilities at shift changes
- Workforce dynamics (Workforce discipline): provider availability affected by turnover, recruitment pipeline delays, cross-training coverage rules
- Financial constraints (Public Finance discipline): cost per patient-encounter, reimbursement timing, budget thresholds that trigger operational changes
No other method in the OR toolkit can hold all of these simultaneously. Queueing formulas assume away most of them. Optimization models handle constraints but not stochastic dynamics over time. Network models capture topology but not temporal behavior at nodes. Simulation integrates them — not by solving a single equation, but by letting the components interact in simulated time and observing what emerges.
This integration capability is why the manifest identifies simulation’s integration point as “all disciplines.” It is the computational substrate on which multi-disciplinary analysis actually executes.
Product Owner Lens
What is the operational problem? Healthcare systems make high-stakes structural decisions — consolidation, expansion, service-line changes, staffing redesigns — using spreadsheet analysis that cannot capture the interaction effects that determine real outcomes.
What mechanism explains the behavior? Complex systems produce emergent behavior from the interaction of stochastic components. Closed-form methods require simplifying assumptions that eliminate these interactions. Simulation preserves them by running the system forward in time, event by event.
What intervention levers exist? Simulation itself is a meta-lever: it tests other interventions (staffing changes, routing rules, scheduling policies, facility decisions) before deployment. The lever is not what the simulation models — it is the ability to test before committing.
What should software surface? A scenario comparison interface showing key performance indicators (wait times, throughput, utilization, cost, access equity) across alternative configurations. Probability distributions of outcomes, not point estimates. Sensitivity analysis showing which input assumptions drive the difference between alternatives. Animated or timeline visualizations of entity flow for stakeholder communication.
What metric reveals degradation earliest? The model-reality divergence rate — the growing gap between simulation predictions and observed system behavior. When a validated simulation that once matched reality begins to diverge, one of its assumptions has been violated: demand patterns have shifted, staffing has changed, routing rules have drifted. The divergence is a signal that the system has moved into a regime the model no longer represents, and the model (or the system) needs attention. Track this as a rolling comparison of predicted vs. actual values for 3-5 key outputs, with an alert when any output diverges by more than two standard deviations of the simulation’s output distribution for three consecutive periods.
Summary
Simulation is the analytical method healthcare operations reaches for when the system is too complex for closed-form solutions — and most real healthcare systems are too complex for closed-form solutions. The four paradigms serve different purposes: DES for operational flow (entity-level, event-driven), Monte Carlo for risk and uncertainty quantification (distribution-level, sampling-based), system dynamics for policy and workforce planning (aggregate-level, feedback-driven), and agent-based modeling for behavioral dynamics (individual-level, rule-driven). The simulation workflow — conceptual model, data collection, model building, verification, validation, experimentation, analysis — is not bureaucratic overhead; it is the difference between a decision support tool and an elaborate random number generator. Validation against historical data is the single most important quality gate, and it is the one most frequently skipped.
Simulation’s unique role in the CapabilityGraph is as the integration method: the analytical framework that can hold queueing dynamics, optimization constraints, network topology, scheduling rules, human factors, workforce economics, and financial constraints simultaneously. When the ED consolidation model reveals that a decision which improves average wait times also creates a transport gap that affects time-sensitive mortality — a finding invisible to any single-discipline analysis — it demonstrates what simulation is for: making emergent consequences visible before they become real.