Queueing Foundations

Module 2: Queueing Theory and Wait-Time Dynamics Depth: Foundation | Target: ~3,000 words

Thesis: Queueing theory reveals that wait times are governed by utilization and variability, not by average capacity — and the relationship is violently nonlinear.


The Operational Problem

Every healthcare system is, at its core, a collection of queues. Patients wait for appointments. They wait in lobbies. They wait for beds, for lab results, for specialist referrals, for prior authorization decisions, for discharge orders. In behavioral health, they wait weeks or months just to begin treatment. These waits are not incidental friction — they are the dominant experience of care delivery from the patient’s perspective, and they are the primary signal of operational failure from the system’s perspective.

The instinctive response to long waits is to add capacity: more beds, more providers, more clinic hours. Sometimes that is correct. But queueing theory — the mathematical study of waiting lines — reveals something more important and far less intuitive: a system can have adequate average capacity and still produce catastrophic waits. The culprits are utilization and variability, and their interaction is nonlinear in a way that punishes operators who plan for averages.

This page establishes the formal foundations. What a queue is in the operations research sense. How queues are classified. What mathematical models predict about their behavior. And why these models, despite their simplifying assumptions, remain the most important analytical tool in healthcare operations.


What a Queue Actually Is

In operations research, a queue is not merely a line of people. It is a formal system with five components:

  1. An arrival process — entities (patients, orders, requests) arriving at some rate, with some pattern of variability
  2. A service mechanism — one or more servers (providers, beds, processors) that handle arrivals, each taking some amount of time
  3. A queue discipline — the rule governing who gets served next (first-come-first-served, priority, triage acuity)
  4. A system capacity — whether the queue has finite or infinite room (can patients balk or be turned away?)
  5. A population source — whether the potential arrivals are effectively infinite (a city’s ED-eligible population) or finite (a panel of enrolled patients)

This decomposition matters because it forces specificity. When a clinic director says “we have a wait time problem,” the queueing framework asks: Is the problem the arrival rate? The service rate? The variability in either? The queue discipline? The system’s capacity to absorb surges? Each diagnosis leads to a different intervention. Queueing theory does not just describe waits — it decomposes them into actionable components.


Kendall Notation: The Taxonomy of Queues

In 1953, David Kendall proposed a compact notation for classifying queues. The notation takes the form A/S/c, where:

  • A = the arrival process distribution (how entities show up)
  • S = the service time distribution (how long each service takes)
  • c = the number of parallel servers

The standard symbols for distributions are:

SymbolMeaningHealthcare Example
MMarkovian (Poisson arrivals / exponential service times) — memorylessED walk-in arrivals
DDeterministic (constant, no variability)Scheduled 15-minute appointment slots (idealized)
GGeneral (any distribution)Surgical procedure durations, which follow neither exponential nor constant patterns

Extended Kendall notation adds three more positions — A/S/c/K/N/D — for system capacity (K), population size (N), and queue discipline (D). When unspecified, the defaults are infinite capacity, infinite population, and FIFO (first-in, first-out).

Why this matters operationally: Kendall notation is not pedantry. It is a diagnostic checklist. An M/M/1 model assumes Poisson arrivals, exponential service times, and a single server. If your system violates any of these assumptions — if service times have high variance (use M/G/1), if you have multiple providers (use M/M/c), if patients abandon the queue (you need an Erlang-A variant) — the wrong model will give you wrong answers about capacity, staffing, and wait times. Kendall notation forces you to name your assumptions before you calculate.


The Core Insight: Utilization, Variability, and the Nonlinear Explosion of Delay

The most important result in queueing theory for healthcare operators can be stated in one sentence: as utilization increases, expected wait time increases nonlinearly — and the nonlinearity becomes savage above roughly 80% utilization.

Define utilization as:

rho = lambda / (c * mu)

where lambda is the arrival rate, mu is the service rate per server, and c is the number of servers. Utilization (rho) is the fraction of available capacity being consumed. At rho = 0.5, half the capacity is used. At rho = 0.9, ninety percent.

For a single-server queue (M/M/1), the expected number of entities in the system is:

L = rho / (1 - rho)

At rho = 0.5, L = 1. At rho = 0.8, L = 4. At rho = 0.9, L = 9. At rho = 0.95, L = 19. The function has a vertical asymptote at rho = 1. This is not a gentle slope — it is a cliff.

This is the utilization-delay curve, and it is the single most underappreciated fact in healthcare capacity planning. A hospital ward running at 85% occupancy does not have 15% headroom — it has a queue that is building faster than most administrators realize. A clinic scheduling providers at 90% utilization is not “nearly full” — it is mathematically guaranteed to produce long, unpredictable waits.

Sir John Kingman formalized the combined effect of utilization and variability in what is now called the Kingman approximation (or VUT equation), published in 1961:

W_q approximately equals (rho / (1 - rho)) * ((c_a^2 + c_s^2) / 2) * t_s

where c_a^2 is the squared coefficient of variation (CV) of inter-arrival times, c_s^2 is the squared CV of service times, and t_s is the mean service time. This formula makes the mechanism explicit:

  • The utilization term rho/(1-rho) is the hockey stick — it drives delay toward infinity as utilization approaches 1
  • The variability term (c_a^2 + c_s^2)/2 is a multiplier — higher variability in either arrivals or service times amplifies delay at every utilization level
  • The service time t_s scales everything — longer average service means longer waits, proportionally

The Kingman formula is an approximation for a G/G/1 queue, but its directional message holds across every queueing model: you cannot understand delay without understanding both utilization and variability, and their interaction is multiplicative.

What This Means in a Hospital

Consider a 20-bed inpatient behavioral health unit. Suppose the average daily admission rate is 3.2 patients, and the average length of stay is 5.8 days. The unit’s effective utilization is:

rho = (3.2 * 5.8) / 20 = 18.56 / 20 = 0.928

At 93% utilization, this unit is deep into the steep region of the delay curve. The mathematical prediction: substantial queues will form. Patients needing admission will board in the ED or be diverted. Small increases in admission rate or length of stay — a flu season, a staffing-driven discharge slowdown — will produce disproportionate increases in wait time.

Now suppose the unit adds two beds, moving to 22 beds. Utilization drops to 0.844. The expected queue length drops by roughly half. Those two beds did not add 10% capacity — they moved the system off the steep part of the curve, producing a disproportionate reduction in delay. This is the nonlinearity working in your favor, and it is the strongest argument for maintaining operational buffer.


M/M/1: The Baseline Model

The M/M/1 queue — Poisson arrivals, exponential service times, one server, infinite capacity, infinite population, FIFO discipline — is the simplest stochastic queue with closed-form results. It is the model every other model is compared against.

Key results for M/M/1 (rho = lambda/mu < 1):

  • Expected number in system: L = rho / (1 - rho)
  • Expected number in queue: L_q = rho^2 / (1 - rho)
  • Expected time in system: W = 1 / (mu - lambda)
  • Expected time in queue: W_q = rho / (mu - lambda)

These are connected by Little’s Law (John D.C. Little, 1961): L = lambda * W. Little’s Law is treated in full in the companion page (02-littles-law.md), but its importance here is foundational. It states that the long-run average number of entities in a stable system equals the long-run arrival rate multiplied by the average time each entity spends in the system. It requires no assumptions about arrival or service distributions — it holds for any stable queue.

Where M/M/1 applies in healthcare: The single-server model maps to any resource that is one-of-a-kind and cannot be parallelized. A single MRI machine. A sole community behavioral health provider. A prior authorization desk staffed by one reviewer. A rural critical access hospital’s single OR suite. In each case, the M/M/1 results give the baseline prediction: at what utilization will waits become unacceptable?

Where M/M/1 breaks: Its assumptions are rarely met in full. Service times in healthcare are not exponential — surgical procedures, therapy sessions, and complex discharges have distributions with heavier tails and sometimes bimodal structure. Arrivals are not purely Poisson over a full day — ED arrivals follow time-of-day patterns with a pronounced afternoon peak. And most healthcare resources involve multiple parallel servers, not one. The M/M/1 model is the teaching tool that calibrates intuition; real systems require extensions.


M/M/c: Multi-Server Queues and the Power of Pooling

Most healthcare settings involve multiple servers: an ED with 4 physicians on shift, a hospital ward with 30 beds, a call center with 8 schedulers. The M/M/c model extends M/M/1 to c parallel servers sharing a single queue.

The M/M/c model produces the Erlang C formula (named for A.K. Erlang, whose foundational work in telephone traffic engineering in the early 1900s launched queueing theory). The Erlang C formula gives the probability that an arriving entity must wait — that all c servers are busy. From this probability, the expected wait can be derived.

The mathematics are more complex than M/M/1, but the operational insight is clear and powerful: pooling servers reduces wait time far more than adding the same capacity in isolated queues.

The Pooling Effect

Suppose a health system operates two clinics, each with 2 providers serving an arrival rate of 3 patients per hour, with an average service time of 30 minutes per patient. Each clinic has utilization rho = 3 / (2 * 2) = 0.75. Both clinics will have moderate queues.

Now suppose the system consolidates into one clinic with 4 providers and a combined arrival rate of 6 patients per hour. Utilization is unchanged: rho = 6 / (4 * 2) = 0.75. But the expected wait time drops substantially — by roughly 40-50% depending on exact parameters. This is variability pooling: the larger pool of servers is less likely to have all servers simultaneously busy, because the random peaks in one stream are offset by random troughs in the other.

This is the mathematical basis for centralized triage, consolidated referral processing, and shared specialist pools. It is also why fragmenting resources — giving each clinic “its own” scheduler, splitting a nursing pool by wing — often increases wait times even when total capacity is unchanged.

Erlang-Based Staffing

Linda Green at Columbia Business School demonstrated the practical power of Erlang models for hospital staffing. Her work showed that applying M/M/c queueing models to hospital bed management could predict when occupancy levels would produce unacceptable boarding delays — and that the critical threshold was not a fixed percentage but depended on the number of beds (smaller units hit the steep part of the curve at lower utilization) and the variability of length of stay.

The related Erlang B (loss) formula applies when there is no queue — arrivals that find all servers busy are lost. This models hospital beds when diversion occurs: if all beds are full, the ambulance goes elsewhere. Erlang B answers the question: “How many beds do we need to keep the probability of turning away a patient below X%?”

The Erlang-R model (Yom-Tov and Mandelbaum, 2014) extends these ideas specifically for healthcare, where patients are “reentrant” — a patient seen by a nurse may be sent for imaging, then return to the nurse for follow-up. Standard Erlang models underestimate staffing needs because they do not account for this cycling.


M/G/1 and the Pollaczek-Khinchine Formula: Why Service Time Variability Matters

The M/M/1 and M/M/c models assume exponential service times. In healthcare, this assumption is often wrong. Surgical procedure times are not exponential — they have a minimum duration, a mode, and an extended right tail for complications. Behavioral health sessions may be 50 minutes (therapy) or 15 minutes (medication management), creating a bimodal distribution. ED treatment times vary enormously by acuity.

The M/G/1 model relaxes the service time assumption, allowing any (General) distribution. The key result is the Pollaczek-Khinchine (P-K) mean value formula, derived independently by Felix Pollaczek (1930) and Aleksandr Khinchine (1932):

L_q = (rho^2 * (1 + c_s^2)) / (2 * (1 - rho))

where c_s^2 = Var(S) / [E(S)]^2 is the squared coefficient of variation of service time.

The P-K formula reveals something that the M/M/1 model hides: two systems with identical utilization and identical mean service time will have very different queue lengths if their service time variability differs. The queue length is directly proportional to (1 + c_s^2). For exponential service (c_s^2 = 1), you get the M/M/1 result. For deterministic service (c_s^2 = 0), queue length is halved. For high-variability service (c_s^2 = 3 or 4, common in surgical suites), queue length doubles or triples relative to exponential.

The operational implication is direct: reducing variability in service times — through standardized protocols, better pre-procedure preparation, predictable discharge processes — reduces wait times even without adding capacity or reducing demand. This is the queueing-theoretic basis for Lean healthcare’s emphasis on standardization. It is not about rigidity; it is about reducing the variance multiplier in the P-K formula.

A Surgical Suite Example

Consider a single OR suite handling general surgery cases. Average procedure time is 90 minutes. If the CV of procedure time is 0.5 (relatively standardized cases), c_s^2 = 0.25, and the variability term (1 + c_s^2)/2 = 0.625. If instead the suite handles a mixed caseload with CV = 1.2 (mix of short scopes and long tumor resections), c_s^2 = 1.44, and the variability term = 1.22. At the same utilization, the mixed-caseload suite will have roughly twice the expected queue length. This is why surgical scheduling that groups similar-duration cases outperforms random sequencing — it is not just convenient, it is mathematically optimal for throughput.


Queue Discipline: Triage as a Priority Queue

In the default queueing model, service is FIFO — first-in, first-out. Healthcare almost never operates this way. Emergency departments use triage (ESI levels 1-5). Organ transplant lists use medical urgency. Prior authorization queues may prioritize by payer or clinical urgency. Behavioral health waitlists often lack any formal discipline at all, which is itself a (bad) choice.

Priority queues assign arriving entities to priority classes. Higher-priority entities are served before lower-priority entities regardless of arrival time. The mathematical result is stark: priority queueing reduces wait time for high-priority entities at the direct expense of lower-priority ones, and the total average wait across all classes is unchanged or slightly worse (due to the preemption overhead in some variants).

This means triage does not reduce waits — it redistributes them. An ESI-1 patient (resuscitation) is seen immediately because ESI-4 and ESI-5 patients absorb the delay. This is the correct operational choice, but it must be understood as a zero-sum reallocation, not a throughput improvement. When an ED is crowded, faster service for critical patients comes entirely from longer waits for lower-acuity patients.

The practical consequence: improvements to queue discipline (better triage accuracy, faster ESI assignment) improve outcomes within a given capacity, but they cannot substitute for adequate capacity. A perfectly triaged ED that is running at 95% utilization will still have long waits for non-critical patients. Triage optimizes the allocation of suffering; it does not reduce the total.


Abandonment and Balking: The Queue You Cannot See

Standard queueing models assume infinite patience — entities wait forever. Real patients do not. They leave the ED without being seen (LWBS). They fail to follow through on specialist referrals. They stop calling the scheduling line after being on hold for 20 minutes. They find an alternative provider or simply go without care.

The national average LWBS rate for US emergency departments is approximately 2%, but this figure obscures enormous variation. High-volume urban EDs report rates of 10-15%, and LWBS rates rise sharply with wait time. Studies show the inflection point around 60-90 minutes: patients with low-acuity needs who face waits exceeding this threshold abandon at dramatically higher rates.

In queueing theory, this is modeled as abandonment (leaving after joining the queue) and balking (refusing to join upon seeing the queue length). The Erlang-A model (A for abandonment) extends the Erlang C framework by adding a patience distribution, allowing the model to predict what fraction of demand is “served” versus “lost.”

Why this matters for measurement: Abandonment suppresses visible wait times. If your most delay-sensitive patients leave, the average wait time of those who remain looks better than the actual access problem. An ED that reports a 45-minute average wait time but has an 8% LWBS rate has a different — and worse — access reality than one with a 50-minute average wait and 1% LWBS. The queue you can see (patients still waiting) understates the queue you should care about (everyone who needed care).

LWBS is not a patient compliance problem. It is a queue overflow signal. Every LWBS event represents a failure of the system to serve demand within the population’s tolerance for delay.


The Limits of Queueing Models

Queueing theory is powerful, but its power comes from simplifying assumptions that real healthcare systems regularly violate:

Non-stationary arrivals. Standard models assume constant arrival rates. ED arrivals peak in the afternoon and trough at 4 AM. Clinic demand is seasonal. Grant processing has fiscal-year-end spikes. Time-varying arrival rates require modified models (the “Modified Offered Load” approach, or pointwise stationary approximations) that lose the clean closed-form results.

Complex routing. A patient in an ED does not pass through a single queue. They are triaged, placed in a bed, seen by a physician, sent to imaging, returned for results review, possibly admitted. This is a queueing network, not a single queue — and the interactions between queues (a backed-up imaging department delays physician throughput, which delays bed turnover) create cascading effects that single-queue models cannot capture.

State-dependent service rates. When an ED is overwhelmed, physicians speed up (shorter evaluations, faster dispositions) or slow down (cognitive overload, decision fatigue). Service rates are not independent of queue length — they are endogenous. This is where queueing theory intersects Human Factors Module 2: the same high-utilization state that produces long waits also degrades the quality of clinical decisions being made under time pressure.

Heterogeneous servers. Not all providers are interchangeable. A newly credentialed PA and a 20-year attending have different effective service rates, different case-mix capabilities, and different quality profiles. The M/M/c model assumes identical servers.

When to switch to simulation. When the system involves non-stationary arrivals, complex multi-stage routing, state-dependent service rates, or heterogeneous servers — which describes most real hospital operations — closed-form queueing models give directional insight but not operational precision. This is where discrete-event simulation (Module 6) takes over: you build the system computationally, with all its realistic messiness, and run thousands of replications to estimate performance. Queueing theory tells you which parameters matter and roughly how. Simulation tells you exactly what will happen with this specific configuration.

The relationship is not competitive — it is sequential. Use queueing theory first to understand the mechanism, identify the dominant drivers, and estimate the ballpark. Use simulation when you need to test a specific intervention against a realistic model of the actual system.


Integration Points

Human Factors Module 2: Fatigue and Decision Degradation. The utilization-delay curve has a twin that operates on clinicians rather than patients. A provider at 92% utilization is not just producing long patient waits — they are operating in a cognitive state where decision quality degrades, error rates increase, and burnout accelerates. The same rho/(1-rho) dynamic that predicts queue buildup predicts the erosion of cognitive margin. Staffing decisions that ignore this coupling optimize for throughput at the cost of safety. The queueing model says you need buffer to control waits; the human factors model says you need the same buffer to control errors. The argument for operational slack is doubly reinforced.

Workforce Module 1: Workforce as Capacity Infrastructure. In every queueing formula on this page, the service rate mu and server count c are set by staffing decisions. A vacancy does not just reduce headcount — it increases rho, pushing the system toward the steep region of the delay curve. If a 4-provider clinic loses one provider and does not proportionally reduce demand, utilization jumps from 0.75 to 1.0, and the queue becomes unstable (infinite expected wait). This is why single-provider vacancies at small sites produce access crises that appear disproportionate to the “one person” lost. The queueing model makes this arithmetic explicit: small absolute reductions in c produce large relative increases in rho at small-c sites.


Product Owner Lens

What is the operational problem? Patients wait too long, staff are overloaded, and administrators lack the analytical framework to connect these symptoms to their root causes — utilization levels and variability patterns that are measurable but unmeasured.

What mechanism explains the system behavior? The nonlinear relationship between utilization and delay, amplified by variability in both arrivals and service times. The mechanism is mathematical: rho/(1-rho) with a variability multiplier.

What intervention levers exist?

  • Reduce utilization: add capacity (beds, providers, hours) or manage demand (diversion, load-leveling, demand smoothing)
  • Reduce variability: standardize service times (protocols, pre-visit planning), smooth arrivals (scheduling discipline, staggered appointments)
  • Pool resources: consolidate fragmented queues into shared pools (centralized triage, float pools, unified referral processing)
  • Improve queue discipline: implement or refine priority rules so the right patients are served first (better triage, acuity-based scheduling)

What should software surface?

  • Real-time utilization by resource (beds, providers, rooms, authorization reviewers) with color thresholds at 75%, 85%, 90%
  • Wait time distributions, not just averages — the 90th percentile wait is more actionable than the mean
  • LWBS and abandonment rates as queue overflow indicators, tracked hourly
  • Arrival rate vs. service rate trending to predict queue instability before it manifests as long waits
  • Coefficient of variation of service times by service line, to identify high-variability processes

What metric reveals degradation earliest? The ratio of arrival rate to departure rate over a rolling window (2-4 hours in an ED, daily in a clinic). When this ratio exceeds 1.0 persistently, a queue is building — even if current wait times are still acceptable. By the time wait times are visibly bad, the queue has already accumulated. The input-output ratio is the leading indicator; the wait time is the lagging one.


Summary

Queueing theory provides the mathematical language for the most universal problem in healthcare operations: waiting. Its core models — M/M/1, M/M/c, M/G/1 — are simplifications, but they reveal the mechanisms that govern delay in every setting from a rural clinic to a Level I trauma center. The central insight is not complicated, but it is profoundly counterintuitive to operators trained on averages: a system that looks like it has enough capacity on paper will produce unacceptable waits if utilization is high and variability is unmanaged. The relationship is not proportional. It is explosive. And the only way to manage it is to measure it.

The formulas on this page — Kingman’s approximation, the Erlang models, the Pollaczek-Khinchine result — are not academic exercises. They are the minimum viable toolkit for any operator who wants to understand why their system behaves the way it does, and what it would take to change it.