Resilience Engineering: From Preventing Failure to Enabling Recovery

Module 5: Human Error, Failure Modes, and Recovery Depth: Application | Target: ~2,000 words

Thesis: Resilience engineering shifts the focus from preventing failure to enabling recovery — designing systems that detect, adapt, and recover when things go wrong, as they inevitably will.

The Operational Problem

A 25-bed critical access hospital in rural eastern Washington staffs two RNs for the overnight shift covering a 12-bed med-surg unit. At 3:14 AM, a 74-year-old post-surgical patient in bed 6 develops acute respiratory distress — dropping SpO2 from 94% to 82% over eight minutes. The protocol says: call the rapid response team. There is no rapid response team. The hospital’s physician is on call from home, twenty minutes away. The RT is cross-covering the ED. It is these two nurses, right now, with what they have.

Here is what actually happens. Nurse A recognizes the deterioration pattern from experience, initiates high-flow oxygen, repositions the patient, and begins a focused assessment. Nurse B redistributes monitoring tasks for the remaining eleven patients, pulls the crash cart to the hallway outside bed 6, and pages the on-call physician with a concise SBAR — situation, background, assessment, recommendation — that includes a specific request: “I need a verbal order for BiPAP and a chest X-ray, and I need you here if he doesn’t stabilize in fifteen minutes.” The physician, trusting the nurse’s judgment from years of working together, gives the orders by phone. The patient stabilizes. The incident never becomes a code. No harm report is filed. No one outside the unit ever knows it happened.

A traditional safety analysis — what Hollnagel calls Safety-I — would see this event through the lens of protocol deviation. The rapid response protocol was not followed. Documentation was incomplete. The verbal order process deviated from the standard. A compliance review might generate a corrective action plan requiring the hospital to “ensure adherence to the rapid response protocol” — a protocol designed for a 400-bed urban medical center with a 24/7 in-house team, applied without modification to a facility where no such team exists.

A Safety-II analysis sees something entirely different. It sees two experienced clinicians detecting a deterioration pattern early, improvising an effective response with available resources, redistributing workload in real time, communicating precisely under pressure, and making a sound judgment call about when to involve the physician. It sees expertise, coordination, and adaptation preventing a worse outcome. It sees resilience.

The question is not whether the protocol was followed. The question is why the patient survived.

Safety-I and Safety-II: Two Paradigms for Understanding Safety

Erik Hollnagel introduced the Safety-I/Safety-II distinction in his 2014 work Safety-I and Safety-II: The Past and Future of Safety Management. The distinction is not merely semantic — it represents a fundamental shift in what safety management tries to understand, measure, and improve.

Safety-I defines safety as the absence of adverse events. Its logic is straightforward: identify what goes wrong, find the cause, eliminate or contain it. The tools of Safety-I are root cause analysis, incident investigation, barrier models, and compliance protocols. The underlying assumption is that things go right because the system functions as designed, and things go wrong because something — a component, a person, a process — deviates from the design. Safety-I asks: What went wrong?

Safety-II defines safety as the presence of adaptive capacity. Its logic is different: things go right not because the system works as designed, but because people continuously adjust their performance to match the conditions — conditions that the design could not fully anticipate. Safety-II asks: What goes right, and how?

The distinction matters operationally because Safety-I and Safety-II lead to fundamentally different interventions. Safety-I produces more barriers, more protocols, more standardization, more compliance monitoring. Safety-II produces better situation awareness, more flexible tools, clearer escalation paths, and organizational permission to adapt. Safety-I treats deviation as a threat. Safety-II treats the capacity for deviation as the last line of defense.

Almost all safety work in healthcare today is Safety-I. Incident reporting systems, sentinel event reviews, The Joint Commission’s National Patient Safety Goals, root cause analysis requirements — all focus on what went wrong. This is not wrong. Preventing known failure modes is necessary. But it is incomplete, because complex systems generate more failure modes than any prevention program can enumerate. The safety that remains after every protocol has been written and every barrier has been installed comes from the people who adapt when the protocols do not fit and the barriers do not hold.

The Four Cornerstones of Resilience

Hollnagel’s resilience engineering framework (2009, 2011) identifies four capabilities that a resilient system must possess. These are not compliance items to be checked off. They are dynamic capacities that must be cultivated, maintained, and exercised.

Responding — knowing what to do now. The capacity to address the actual situation as it unfolds, not the situation that was planned for. This requires accurate perception of current conditions, a repertoire of responses, and the authority to act. In the overnight scenario, the nurses’ ability to respond depended on clinical expertise (recognizing the deterioration pattern), resource knowledge (knowing what equipment was available and where), and organizational permission (confidence that improvising would be supported, not punished).

Monitoring — knowing what to look for. The capacity to detect changes in conditions that could lead to disruption or opportunity. Monitoring is not passive data collection; it is active surveillance guided by a model of what matters. The nurse who noticed the SpO2 trend was not responding to an alarm — she was monitoring a patient she had flagged as higher-risk based on surgical history and age. Effective monitoring requires both the right information and the cognitive bandwidth to interpret it.

Learning — knowing what happened. The capacity to extract lessons from experience — both from failures and from successes. Safety-I organizations learn primarily from incidents. Resilient organizations also learn from everyday adaptations — the workarounds, improvisations, and judgment calls that keep the system functioning. If the overnight scenario is never discussed, never debriefed, never analyzed for what went right, the organization loses the opportunity to understand its own resilience. The expertise that prevented a code blue remains tacit, locked in two nurses who happen to work well together.

Anticipating — knowing what to expect. The capacity to foresee future threats and opportunities before they materialize. Anticipation goes beyond risk assessment (which catalogs known hazards) to include imagining novel failure modes and changed operating conditions. A hospital that anticipates the overnight staffing vulnerability — before the crisis — might pre-position equipment, establish standing orders for common deterioration scenarios, or cross-train respiratory therapy to respond from the ED. Anticipation converts future adaptation needs into present design decisions.

These four capabilities operate at different time horizons. Responding is immediate (seconds to minutes). Monitoring is continuous (minutes to hours). Learning is retrospective (days to months). Anticipating is prospective (months to years). A system that is strong on responding but weak on anticipating will survive individual crises but be surprised by systemic ones. A system that is strong on anticipating but weak on responding will have excellent plans that fail on contact with reality.

Why Prevention Alone Fails

The prevention-only model assumes that safety can be achieved by identifying and eliminating failure modes. For simple systems with a finite number of failure modes, this works. For complex adaptive systems — which is what healthcare delivery is — it does not, and the reason is mathematical, not philosophical.

Complex systems have combinatorial failure spaces. A workflow with 20 steps, each with 3 possible states (normal, degraded, failed), has 3^20 — approximately 3.5 billion — possible system states. No prevention program can enumerate them all. No protocol library can address them all. No training program can prepare clinicians for them all. Woods (2006) described this as the “problem of the unanticipated” — the recognition that in complex systems, the most dangerous failures are precisely the ones that were not foreseen.

The remaining safety — the safety that exists after every known failure mode has been barriered — comes from human adaptation. Workarounds. Improvisations. Judgment calls. Escalations. These are not protocol deviations to be eliminated. They are the mechanism by which complex systems actually stay safe. Braithwaite et al. (2015) estimated that approximately 95% of healthcare encounters proceed without incident, and that this success rate is maintained not because protocols cover every contingency, but because clinicians continuously adjust their performance to bridge the gap between what the system provides and what the situation demands.

This creates a design imperative: if human adaptation is the primary safety mechanism in complex systems, then the system should be designed to support adaptation, not suppress it.

The Paradox of Safety Metrics

Traditional safety metrics count what goes wrong: incident reports, sentinel events, near-misses, medication errors, falls, hospital-acquired infections. These are Safety-I metrics. They are important. They are also structurally incomplete.

The paradox is this: safety is produced by the thousands of daily adaptations that prevent incidents from occurring. A nurse who catches a medication discrepancy during a scan. A pharmacist who calls to clarify a dose that is technically within range but unusual for this patient’s renal function. A charge nurse who redistributes assignments when she notices a new admit is about to overwhelm a less experienced colleague. None of these become data points. You cannot count what did not happen.

Hollnagel (2014) noted that organizations focused exclusively on Safety-I metrics develop a distorted picture of their own safety. A unit with zero incident reports might be genuinely safe — or it might be a unit where the reporting culture has collapsed, or where clinicians are performing heroic adaptations every shift to compensate for systemic deficiencies that never surface as reportable events. The zero tells you nothing about which reality you are living in.

This has direct implications for safety measurement. A resilience-oriented approach supplements failure metrics with capacity metrics: staffing ratios that permit adaptation (not just minimum coverage), equipment availability at the point of need, communication channel reliability, time available for clinical judgment (not consumed by documentation), and debriefing frequency. These do not measure safety directly. They measure the conditions that make adaptive safety possible.

Work-as-Imagined Versus Work-as-Done

Hollnagel’s distinction between work-as-imagined (WAI) and work-as-done (WAD) is one of the most operationally useful concepts in resilience engineering. Work-as-imagined is how managers, regulators, and protocol designers believe work is performed. Work-as-done is how it actually happens.

The gap between WAI and WAD is not a compliance failure. It is an inevitable consequence of the fact that protocols are written for anticipated conditions, and real conditions vary continuously. Dekker (2006, 2011) documented extensively how the gap between prescribed procedures and actual practice is where safety expertise lives — practitioners adjust, improvise, and adapt precisely because the prescribed procedure does not fit the situation. Closing this gap by enforcing stricter compliance does not improve safety. It removes the adaptive capacity that was compensating for the protocol’s limitations.

The overnight staffing scenario illustrates the point precisely. The rapid response protocol imagines a world where a dedicated team exists. The work-as-done involves two nurses creating an effective response from available resources. If management’s response to this gap is “follow the protocol,” the result is not improved safety — it is a requirement to activate a team that does not exist, which will produce either a workaround (calling it a rapid response when it is actually the same two nurses) or a delay (waiting for a response that never comes while the patient deteriorates).

The productive response is to study the gap. What are nurses actually doing at 3 AM when patients deteriorate? What resources do they actually use? What decisions do they actually make? What information do they need that they do not have? The answers to these questions — obtained by observing and debriefing work-as-done, not by auditing compliance with work-as-imagined — are the raw material for genuine safety improvement.

Designing for Resilience

If resilience depends on human adaptation, then system design should support the conditions under which adaptation succeeds. Hollnagel, Woods, and Dekker converge on several design principles:

Visible system state. Adaptation requires accurate perception of current conditions. When system state is hidden — patient status buried in chart tabs, staffing levels unknown to the charge nurse, equipment location uncertain — adaptation is delayed or misdirected. Design implication: real-time visibility into patient acuity, staffing, equipment, and workload at the unit level, not as a management dashboard but as a clinical tool.

Flexible tools. Rigid workflows that enforce a single path break when conditions deviate from assumptions. Flexible tools provide capability without prescribing sequence. A medication administration system that hard-stops on any deviation treats every departure as an error. A system that distinguishes hard stops (wrong patient, wrong drug) from soft alerts (unusual dose, unusual timing) preserves the clinician’s ability to adapt to non-standard situations while still catching genuine errors.

Clear escalation paths. When the situation exceeds the capacity of the immediate team, escalation must be fast, clear, and low-friction. In the overnight scenario, the nurse’s ability to reach the physician with a concise, structured communication — and the physician’s willingness to act on it — was the escalation path. Systems that require multiple approvals, hierarchical chains, or formal activation criteria before escalation can occur are designing delay into the escalation path.

Organizational freedom to deviate. Dekker’s concept of “just culture” (2007) holds that organizations must distinguish between acceptable and unacceptable behavior without defaulting to blame for every deviation. If clinicians believe that any protocol deviation will trigger a punitive response, they will either follow the protocol even when it does not fit (producing harm through rigid compliance) or deviate and hide the deviation (producing invisible workarounds that cannot be studied or improved). Neither outcome produces safety. A just culture signals that thoughtful adaptation is valued — and that reckless deviation is not.

Warning Signs

Operators should watch for these indicators that resilience capacity is degrading:

Adaptation is invisible. No one asks how the overnight shift actually manages deteriorating patients. No debriefing occurs after successful saves. The workarounds that keep the system safe are unknown to management.
Compliance is the only metric. Protocol adherence is tracked, but adaptive capacity is not. A unit is judged safe because it follows procedures, with no assessment of whether those procedures fit the actual operating conditions.
Deviation is always punished. Any departure from protocol triggers corrective action, regardless of context or outcome. Clinicians learn to document compliance rather than practice adaptation.
Staffing is designed for average conditions. When staffing ratios assume normal patient acuity, normal census, and no simultaneous demands, there is no slack for adaptation when conditions deviate from normal — which they do routinely.
The gap between WAI and WAD is growing but unexamined. Procedures are updated without observing how work is actually done. Protocols are added without removing obsolete ones. The documentation burden increases while the time for clinical judgment decreases.

Integration Points

HF Module 7: Team Dynamics. The overnight scenario is not a story about two individuals — it is a story about a team. The nurses’ ability to redistribute tasks, communicate with the physician, and make joint decisions about escalation depends on shared mental models, practiced coordination, and the psychological safety to voice concerns and improvise without fear of blame. Crew Resource Management (CRM) principles from aviation — explicit role assignment, cross-monitoring, assertive communication — are the team-level mechanisms through which resilience operates. Without team-level competence, individual adaptive capacity is insufficient. A single nurse managing the deterioration alone, without a partner to redistribute load and provide backup, faces a fundamentally different — and far more dangerous — situation.

OR Module 6: Simulation. Simulation provides the controlled environment where resilience can be tested before it is needed. Tabletop exercises and in-situ simulations can model novel scenarios — the ones that fall outside existing protocols — and reveal whether the system has the adaptive capacity to respond. A simulation that introduces a deteriorating patient during a shift with reduced staffing tests not the protocol but the system’s ability to adapt when the protocol does not apply. Simulation also supports the learning cornerstone: debriefing after simulated scenarios makes adaptive expertise visible, shareable, and improvable. Without simulation, learning depends on waiting for real crises — a strategy that is both slow and dangerous.

Product Owner Lens

What is the human behavior problem? Healthcare systems invest almost exclusively in preventing known failures (Safety-I) while the adaptive behaviors that produce most of the actual safety (Safety-II) are invisible, unmeasured, and unsupported. When adaptive capacity degrades — through staffing cuts, rigid protocols, punitive culture, or tool inflexibility — the system becomes brittle without anyone noticing until a failure occurs that adaptation would have caught.

What cognitive mechanism explains it? Hollnagel’s resilience engineering framework: systems in complex environments maintain safety through four dynamic capabilities — responding, monitoring, learning, and anticipating. These capabilities require accurate situation awareness, flexible tools, organizational trust, and time for reflection. When any of these conditions is absent, adaptation fails and the system relies solely on barriers that cannot cover the full failure space.

What design lever improves it? Shift from pure compliance monitoring to resilience capacity monitoring. Support work-as-done by observing actual practice, not just auditing prescribed practice. Design tools that are flexible rather than rigid. Create organizational structures (debriefing, just culture, simulation programs) that make adaptive expertise visible and shareable.

What should software surface? Adaptation-condition indicators: real-time staffing relative to acuity (not just census), communication channel availability, equipment location and status, time-since-last-break for on-duty clinicians, and escalation path status (who is available, how fast can they respond). Debriefing prompts after high-acuity events — not incident reports, but structured reflection on what went right and what was difficult. Work-as-done capture: lightweight mechanisms for clinicians to note when and why they deviated from protocol, without triggering compliance review.

What metric reveals degradation earliest? The ratio of adaptive capacity to adaptive demand. Adaptive demand increases with patient acuity, census, simultaneous events, and unfamiliar situations. Adaptive capacity decreases with staffing reduction, fatigue, tool rigidity, and cultural fear of deviation. When demand consistently approaches capacity — even if no incidents occur — the system is operating without margin. The incidents that do not happen today are the ones that will happen when one more variable shifts. A system that tracks only incidents will see nothing until the margin is gone.