Incentive Gaming: When Metrics Become Targets

Module 8: Adversarial and Malicious Behavior Depth: Foundation | Target: ~2,500 words

Thesis: When a measure becomes a target, it ceases to be a good measure (Goodhart’s Law) — and healthcare is saturated with metrics that are being optimized at the expense of the outcomes they were designed to track.

The Operational Problem

In 2006, CMS began publicly reporting hospital door-to-balloon times for ST-elevation myocardial infarction (STEMI) patients — the interval between arrival at the emergency department and the opening of the blocked coronary artery via percutaneous coronary intervention. The clinical rationale was sound: every minute of delay increases myocardial damage and mortality. The metric was well-defined, clinically meaningful, and directly tied to patient outcomes. The target was 90 minutes.

Hospitals responded. Door-to-balloon times fell dramatically across the United States — a quality improvement success story cited in hundreds of conference presentations and journal articles. But something else happened that the metric did not capture. Emergency medical services began holding STEMI patients in ambulances in the hospital parking lot while the catheterization team was being mobilized. The clock did not start until the patient crossed the emergency department threshold. By managing the moment of “arrival,” hospitals could report shorter door-to-balloon times without actually treating patients faster. The patient’s total ischemic time — the thing that actually kills myocardium — was unchanged or, in some cases, longer. The metric improved. The outcome it was designed to track did not.

This is not fraud. No regulation was violated. The metric was defined as starting at ED arrival, and the hospitals controlled when ED arrival occurred. This is gaming — the rational exploitation of the gap between what a metric measures and what it was intended to represent. The distinction matters because gaming cannot be solved by enforcement. It can only be solved by better metric design, and better metric design requires understanding the behavioral mechanism that produces gaming in the first place.

The Formal Framework: Goodhart’s Law and Campbell’s Law

Two independently formulated laws describe the same phenomenon from different angles.

Goodhart’s Law, originally stated by the British economist Charles Goodhart (1975) in the context of monetary policy: “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” Goodhart’s observation was that the Bank of England’s monetary targets became unreliable predictors of economic activity once those targets were used as policy instruments. The act of managing to the metric changed the relationship between the metric and the underlying phenomenon it was supposed to track. Strathern (1997) later restated this more crisply: “When a measure becomes a target, it ceases to be a good measure.”

Campbell’s Law, formulated by the social psychologist Donald Campbell (1979): “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.” Campbell was studying the effects of high-stakes testing in education, but the law is universal. The critical addition over Goodhart is the word corrupt — Campbell recognized that optimization pressure does not merely degrade the statistical relationship between the metric and the outcome. It actively distorts the underlying behavior, redirecting effort from the outcome toward the metric.

The mechanism connecting these laws is straightforward. When a metric carries consequences — reimbursement, public reputation, regulatory action, continued funding — every rational actor in the system faces an optimization problem: maximize the metric. If the metric perfectly captures the outcome, optimizing the metric and optimizing the outcome are identical. But no metric perfectly captures any outcome. There is always a gap between the proxy (what is measured) and the construct (what matters). Optimization pressure exploits that gap. Effort flows toward the measurable proxy and away from the unmeasured dimensions of the outcome. The greater the stakes attached to the metric, the wider the gap becomes, because the incentive to find and exploit the gap increases with the consequences.

This is not a story about bad actors. It is a story about rational responses to measurement systems that confuse the proxy for the construct. The same clinicians and administrators who game metrics are, in most cases, genuinely committed to patient care. They game because the system has created a situation where the measurable proxy diverges from the actual outcome, and the consequences are attached to the proxy. Given that structure, gaming is the expected behavior. Expecting otherwise requires assuming that people will systematically ignore the incentive system they operate within — an assumption that no serious model of human behavior supports.

The Four Types of Gaming

Gaming is not monolithic. It manifests through at least four distinct mechanisms, each exploiting a different weakness in metric design.

Cherry-picking is the selective inclusion or exclusion of cases to improve the measured result. When surgical mortality rates are publicly reported, surgeons and hospitals face an incentive to avoid operating on the highest-risk patients — because those patients are most likely to die and worsen the reported metric. Werner and Asch (2005) documented this in cardiac surgery: hospitals with publicly reported mortality rates showed evidence of avoiding high-risk CABG patients who might have benefited from surgery. The metric improves — the reported mortality rate falls — but the unmeasured outcome is that high-risk patients who need surgery do not receive it, or are transferred to safety-net hospitals where their outcomes are worse and the risk is concentrated. Cherry-picking does not improve outcomes. It redistributes them, concentrating poor outcomes in institutions that cannot or will not select patients, while making selective institutions look better on paper.

Teaching to the test is the narrowing of effort to the specific activities captured by the metric, at the expense of related activities that are not measured. When CMS penalizes hospitals for 30-day readmission rates in heart failure, pneumonia, and acute myocardial infarction, the incentive is to invest heavily in post-discharge follow-up for those three conditions. This may improve outcomes for those conditions. But the resources — care coordinators, follow-up clinics, transitional care programs — are finite. The unmeasured conditions receive correspondingly less attention. The hospital optimizes on the measured set and neglects the unmeasured set, not because it does not care about other conditions, but because the penalty structure directs resources toward the penalized metrics. The system’s overall readmission performance may not improve; the composition shifts.

Threshold manipulation is the gaming of the boundary condition that determines whether a case counts. The readmission penalty applies only to patients who are formally admitted. Patients placed on “observation status” — even for stays of 24-48 hours involving identical clinical care — are not technically admitted and therefore cannot generate a readmission. The growth of observation stays in the years following the implementation of the Hospital Readmissions Reduction Program is well-documented: Feng et al. (2012) and subsequent analyses showed that observation stays increased substantially as readmission penalties took effect. The metric — 30-day readmission rate following admission — improved. Whether patients actually experienced fewer returns to acute care is a different question, because the definition of “admission” was being manipulated at the boundary.

Definitional gaming is the exploitation of ambiguity in how inputs to the metric are coded or classified. Risk-adjustment models for surgical mortality use diagnostic codes to estimate patient severity. When mortality rates carry consequences, hospitals have an incentive to code patients as sicker than they are — to inflate the denominator of the risk-adjustment equation, making the same number of deaths appear as a lower mortality rate. This is not fabricating diagnoses. It is systematically selecting the most severe plausible code from a set of legitimate options. The patient with borderline malnutrition is coded as “severe protein-calorie malnutrition.” The patient with mild chronic kidney disease is coded with the highest-stage code that the documentation can support. Each individual coding decision may be defensible. In aggregate, the pattern inflates risk scores, deflates risk-adjusted outcomes, and renders the metric unreliable for its intended purpose: comparing actual quality across institutions.

Why Healthcare Is Maximally Vulnerable

Not all systems are equally susceptible to gaming. Healthcare is maximally vulnerable because it combines four conditions that amplify Goodhart’s Law.

High-stakes metrics tied to money. CMS value-based purchasing, readmission penalties, and quality bonus programs attach millions of dollars in reimbursement to specific metrics. The Hospital Readmissions Reduction Program alone penalizes hospitals up to 3% of base Medicare payments. For a large hospital system, that can be $10-20 million annually. When the financial consequences are this large, the return on investment for gaming strategies — hiring consultants to optimize coding, restructuring observation protocols, investing in metric-targeted interventions rather than broadly effective ones — becomes substantial. The optimization pressure is proportional to the financial exposure.

Public reporting with reputational consequences. Hospital Compare, Leapfrog, and US News & World Report rankings create a second layer of incentive beyond direct reimbursement. Public reputation affects patient volume, physician recruitment, and board confidence. A hospital with a publicly visible high mortality rate or high readmission rate faces consequences that extend far beyond the CMS penalty. The reputational stakes amplify the financial stakes, creating dual pressure on the same metrics.

Regulatory action triggered by metric thresholds. State survey agencies, accreditation bodies, and CMS Conditions of Participation use quality metrics as triggers for regulatory scrutiny. Falling below a threshold can trigger a survey, a corrective action plan, or — in extreme cases — loss of Medicare certification. The consequences are existential: a hospital that loses Medicare certification will close. When the metric determines whether the organization survives, the optimization pressure is not merely financial. It is existential.

Complex, multi-dimensional outcomes measured by simple proxies. Healthcare outcomes are inherently complex. Quality of care involves clinical effectiveness, patient experience, safety, equity, access, and cost — dimensions that frequently trade off against each other. Reducing this complexity to a small number of reportable metrics creates large gaps between the proxy and the construct. Every gap is a gaming opportunity. The more complex the outcome and the simpler the metric, the more room there is for optimization that improves the metric without improving the outcome.

The Gaming-Fraud Distinction

Gaming is not fraud. The distinction is critical for both legal and operational purposes.

Fraud violates the rules as written. Billing for services not rendered, falsifying medical records, fabricating patient encounters — these are violations of law and regulation. Fraud is prosecutable. The intent is deception.

Gaming exploits the rules as written. Holding a patient in an ambulance to manage the door-to-balloon clock violates no regulation. Placing a returning patient on observation status instead of admitting them violates no rule — observation status is a legitimate clinical designation, and the clinician who assigns it may genuinely believe it is the appropriate classification. Coding a patient’s malnutrition as “severe” rather than “moderate” when the clinical evidence is ambiguous violates no coding regulation — the coder is selecting from a legitimate range of options.

The practical consequence of this distinction is that gaming cannot be eliminated by enforcement. You cannot prosecute a hospital for choosing observation status. You cannot fine a surgeon for declining to operate on a high-risk patient. You cannot penalize a coder for selecting the most severe defensible diagnosis code. Gaming lives in the discretionary space that every metric necessarily creates, and that discretionary space will always be exploited when the stakes are high enough. Bevan and Hood (2006), studying gaming in the UK National Health Service’s target regime, concluded that gaming is “a predictable and rational response to targets” and that the appropriate response is not stricter enforcement but better system design.

This does not mean gaming is harmless. Cherry-picking denies surgery to patients who need it. Observation status manipulation shifts costs to patients (who bear higher copays under observation). Risk-coding inflation degrades the data infrastructure that the entire quality measurement system depends on. Gaming produces real harm — but the harm flows from the metric design, not from the moral failings of the people responding to it.

Grant Milestone Gaming

The same dynamics operate outside clinical metrics. Grant-funded programs — including the healthcare transformation programs that CapabilityGraph is designed to support — are subject to milestone-based reporting that creates identical gaming incentives.

A grant requires the grantee to demonstrate “community engagement” by a specific reporting period. The intended outcome is meaningful partnership with community stakeholders that informs program design. The measured proxy is the number of community meetings held and the number of attendees documented. The gaming response: hold meetings designed to maximize attendance counts rather than meaningful engagement. Serve food. Choose convenient times. Invite existing organizational partners who will show up reliably. Document headcounts. The milestone is met. The underlying outcome — genuine community input shaping program direction — may or may not have occurred, and the metric cannot distinguish between the two.

This pattern recurs across grant deliverables: training completion rates that count seat-time without assessing competency, partnership agreements that are signed but never operationalized, sustainability plans that satisfy the reporting requirement but describe no plausible funding pathway. Muller (2018), in The Tyranny of Metrics, describes this as “metric fixation” — the institutional belief that measurement and accountability through metrics will improve performance, combined with the systematic failure to recognize that the metrics themselves are being gamed.

Metric Design Principles That Resist Gaming

If gaming is a rational response to metric design, then the defense against gaming is better metric design. Several principles reduce — though they can never eliminate — gaming vulnerability.

Outcome metrics over process metrics. Process metrics (door-to-balloon time, hand-hygiene compliance rates, screening completion rates) measure what is done. Outcome metrics (risk-adjusted mortality, patient-reported functional status, actual infection rates) measure what results. Process metrics are easier to game because they measure actions under the provider’s control. Outcome metrics are harder to game because they measure consequences that are less directly manipulable — though risk-coding inflation shows that no metric is immune. The principle is not that outcome metrics are ungameable, but that the effort required to game them is higher and the gap between the metric and the underlying outcome is smaller.

Composite metrics. A single metric invites single-dimensional optimization. A composite index that combines multiple measures — mortality, readmission, patient experience, process compliance, cost — is harder to game because improving one dimension at the expense of another worsens the composite score. CMS has moved in this direction with the Star Ratings and Total Performance Score, though the weighting of components within a composite creates its own gaming opportunities (optimizing the most heavily weighted component and neglecting the rest).

Blinded measurement. When the entity being measured controls the data that generates the metric, gaming is structurally enabled. External measurement — chart audits by independent reviewers, patient surveys administered by third parties, clinical outcomes assessed by registries that the hospital does not control — reduces the opportunity to manipulate inputs. The Hospital Consumer Assessment of Healthcare Providers and Systems (HCAHPS) survey is administered by approved vendors, not by the hospitals being rated, specifically to limit response manipulation.

Audit mechanisms with unpredictable sampling. Regular, predictable audits become part of the optimization landscape — organizations prepare for audits the way students prepare for exams, concentrating performance around the audit period. Unpredictable sampling — random chart reviews, unannounced site visits, statistical outlier detection that triggers targeted investigation — raises the cost of gaming by making it impossible to predict when the gaming will be detected. The probability of detection does not need to be high to deter gaming; it needs to be nonzero and unpredictable.

Metric rotation and evolution. A fixed metric set invites fixed gaming strategies. When the metrics change — new measures added, old measures retired, definitions updated — gaming strategies that were optimized for the old metrics become obsolete. CMS’s periodic updates to quality measure specifications serve this function, though the update cycle (typically annual) is slow enough that gaming strategies can adapt.

Red-teaming before deployment. Before any high-stakes metric is implemented, the organization should ask: “If a competent, motivated team were trying to maximize this metric without improving the underlying outcome, how would they do it?” This is adversarial design — the metric equivalent of penetration testing. If the red team can identify gaming strategies, the metric should be redesigned before it carries consequences. This step is almost never performed in healthcare quality measurement, and the gaming that follows is therefore predictable.

Warning Signs

These indicators suggest that gaming is occurring or that a metric system is vulnerable to it:

Metric performance improves but correlated outcomes do not — door-to-balloon times fall but STEMI mortality does not; readmission rates fall but ED return visits rise
Sudden improvement coinciding with policy change rather than clinical change — a metric jumps when a penalty is announced, not when a new clinical intervention is deployed
Coding intensity increases without corresponding clinical change — average case-mix index rises without changes in patient population or clinical practice
Observation stay volume increases as readmission penalties increase — the two metrics are mechanistically linked by threshold manipulation
Milestone reports describe activities rather than outcomes — “we held 12 meetings” rather than “community input changed X about the program design”
Frontline staff describe the metric as a compliance exercise — “we do this for the numbers” rather than “we do this because it helps patients”
Gaming strategies are common knowledge but unacknowledged in official reporting — the gap between what staff describe informally and what the organization reports formally

Integration Points

Public Finance Module 3 (Compliance and Control). Gaming is the behavioral mechanism that undermines compliance systems from within. Compliance regimes assume that measurement generates accountability, and that accountability improves performance. Campbell’s Law shows that this chain breaks when optimization pressure is applied to the measurement itself. The compliance controls described in Public Finance M3 — audit, reporting, financial controls — work only when the metrics they rely on are resistant to gaming. When gaming corrupts the metrics, the compliance system is operating on distorted data and producing distorted conclusions. A compliance system that reports excellent performance on gamed metrics is not a functioning compliance system — it is a gaming amplifier, because it rewards the organizations that game most effectively. The gaming analysis in this module identifies where compliance controls will fail; the compliance framework in Public Finance M3 identifies which controls are needed. Neither is complete without the other.

Public Finance Module 7 (Policy and Incentives). Policy creates the incentive structures that generate gaming pressure. The decision to attach reimbursement to a metric, to publicly report a quality score, or to condition regulatory action on a threshold — these are policy choices that determine the magnitude of optimization pressure applied to each metric. Public Finance M7 addresses how policies are designed and their intended effects; this module addresses the unintended behavioral responses that policies produce. The connection is direct: every policy incentive described in M7 should be evaluated through the gaming lens described here. What are the four types of gaming that this incentive might produce? How large is the gap between the metric and the outcome? What is the cost-benefit ratio of gaming versus genuine improvement? If the answers suggest that gaming is easier than improvement, the policy needs redesign before implementation — not enforcement after gaming is discovered.

Product Owner Lens

What is the human behavior problem? When metrics carry high-stakes consequences — reimbursement, public reputation, regulatory action, continued funding — rational actors optimize the metric rather than the outcome. The gap between the proxy (what is measured) and the construct (what matters) becomes the target of optimization, and the metric becomes progressively less reliable as an indicator of actual performance.

What cognitive or social mechanism explains it? Goodhart’s Law (1975) and Campbell’s Law (1979) describe the mechanism: optimization pressure on a proxy metric distorts behavior away from the underlying outcome. The four gaming types — cherry-picking, teaching to the test, threshold manipulation, and definitional gaming — are the behavioral channels through which this distortion operates. The mechanism is not cognitive bias (people know what they are doing) but rational response to incentive structure (the system rewards metric performance, so metric performance is what is produced).

What design lever improves it? Metric design that minimizes the gap between proxy and construct: outcome metrics over process metrics, composite indices over single measures, blinded external measurement, unpredictable audit sampling, metric rotation, and adversarial red-teaming before deployment. No single lever eliminates gaming; the combination raises the cost of gaming relative to the cost of genuine improvement.

What should software surface? Divergence between correlated metrics — when a process metric improves but its associated outcome metric does not, the divergence is a gaming signal. Coding intensity trends overlaid with clinical volume trends — rising case-mix index without population change suggests definitional gaming. Observation-to-admission ratio trends correlated with penalty exposure. Milestone completion patterns — deliverables that cluster at reporting deadlines rather than distributing across the performance period suggest compliance-oriented rather than outcome-oriented work. Statistical outlier detection: institutions whose reported performance is statistically improbable given their patient mix and resources.

What metric reveals degradation earliest? The correlation coefficient between a process metric and its associated outcome metric over time. When this correlation weakens — when door-to-balloon time and STEMI mortality decouple, when hand-hygiene compliance rates and infection rates decouple, when readmission rates and ED return visit rates decouple — the process metric is being gamed. The weakening correlation is the earliest quantitative signal that the metric has become a target and is ceasing to be a good measure.