Simulation Validation

A simulation that reproduces an observed pattern has demonstrated something, but what it has demonstrated is weaker than most modelers believe. It has shown that a mechanism can produce the pattern. It has not shown that the mechanism is producing the pattern in the target system. The distance between these two claims is where most invalid emergence reasoning lives.

Every canonical model in this framework is a simulation. The question is not whether simulations are useful — they are indispensable. The question is what standard of evidence a simulation must meet before its output can be treated as an explanation rather than an illustration. This page specifies that standard.

The Explanatory vs. Illustrative Distinction

An illustrative simulation shows that a mechanism is sufficient to produce a pattern. You write down local rules, run them, and observe that the macro behavior resembles the phenomenon you are trying to understand. The Schelling model produces residential segregation. The Boids model produces flocking. The SIR model produces epidemic curves. In each case, the simulation demonstrates sufficiency: these rules can produce this output.

An explanatory simulation goes further. It demonstrates that the mechanism’s predictions match observed data not just under passive observation but under intervention — that perturbing the mechanism in the model produces the same change in behavior that perturbing the mechanism in the real system produces. An explanatory simulation makes predictions about data it has not been tuned to fit, and those predictions hold.

The distinction matters because sufficiency is cheap. Many structurally different mechanisms produce the same macro pattern. Power-law degree distributions arise from preferential attachment, from fitness-based attachment, from random copying, and from optimization pressure. An epidemic curve can be produced by contagion dynamics, by independent threshold adoption, or by seasonal forcing. Showing that your mechanism can produce the pattern does not distinguish it from other mechanisms that also can — this is the underdetermination problem that emergence reasoning must confront directly.

The failure to maintain this distinction is the single most common methodological error in published emergence work. A simulation is built, its output is visually compared to the target phenomenon, the resemblance is noted, and the mechanism is declared explanatory. This is not validation. It is pattern matching with extra computational steps. The Transfer Claim Checklist exists in part to catch this error before it propagates.

Calibration

A simulation has parameters. A calibrated simulation derives those parameter values from the target system — from measurement, from independent data, from physical constraints. An uncalibrated simulation derives its parameter values from tuning: adjusting until the output looks right.

Tuning is not calibration. When you adjust parameters until the simulation output matches one dataset, you have fit a curve. You have not validated a mechanism. The simulation will reproduce the dataset you tuned it to — that is tautological. The question is whether it reproduces datasets it has not seen.

This is the simulation equivalent of p-hacking. A researcher with enough free parameters and a single target dataset will always find a combination that produces a match. The match is an artifact of the search process, not evidence for the mechanism. The corrective is the same one used in statistics: out-of-sample prediction. A calibrated reaction-diffusion model of animal coat patterning must predict the pattern wavelengths of species it was not tuned against. A calibrated traffic model must predict congestion dynamics on roads it was not fitted to.

The practical standard: state where every parameter value came from. If a parameter was measured, cite the measurement. If a parameter was estimated from independent data, describe the estimation procedure. If a parameter was tuned to fit the target output, say so — and acknowledge that the model’s ability to reproduce that output is not evidence for the mechanism, only evidence that the parameter space contains a point that works.

Sensitivity Analysis

A model with calibrated parameters still requires interrogation. The question is: which parameters matter?

Parametric sensitivity asks what happens when each parameter varies within its measurement uncertainty. If the model output changes qualitatively — the emergent property appears or disappears, the phase transition shifts, the pattern switches from stripes to spots — when a parameter moves by five percent, the model is fragile. Its predictions depend on knowing that parameter to a precision that the target system may not support. This does not invalidate the model, but it constrains how much confidence the model’s output deserves: the prediction is conditional on a measurement that may not be available.

Structural sensitivity asks a harder question: what happens when the model class changes? If replacing a linear interaction term with a saturating one, or changing the network topology from a lattice to a random graph, or switching from synchronous to asynchronous update destroys the emergent behavior, then the behavior depends on a structural assumption — not just a parameter value. Structural sensitivity is more dangerous than parametric sensitivity because it means the model’s predictions depend on assumptions that cannot be tested by better measurement. They can only be tested by building a different model and checking whether the predictions survive.

The discipline: report both. State which parameters the model is sensitive to and what precision is required. State which structural assumptions the model depends on and whether those assumptions hold in the target system. A model that reports only its robust predictions and conceals its fragile ones is not being honest about what it knows.

Ablation Studies

Ablation is the simulation equivalent of a knockout experiment. You remove one mechanism at a time and observe what happens to the emergent behavior.

If the Boids model still produces coherent flocking after you remove the alignment rule, then alignment is not necessary for flocking in that model. The claim that flocking requires alignment would be falsified by the ablation — the behavior persists without the mechanism. Conversely, if removing alignment destroys flocking while removing separation or cohesion does not, then alignment is the load-bearing mechanism: it is necessary, and the other rules are contributing but not essential.

Ablation answers a question that observation alone cannot: is this mechanism necessary, or is it merely present? A simulation may contain five interacting mechanisms, any three of which are sufficient to produce the target behavior. Without ablation, you cannot determine which mechanisms are doing the causal work and which are passengers.

The procedure is straightforward. Start with the full model. Remove one mechanism. Run the simulation. Record whether the emergent property persists, degrades, or vanishes. Repeat for each mechanism. Then remove pairs, to test for interactions between mechanisms — cases where neither mechanism alone is necessary but at least one of the pair must be present.

Ablation is not optional. A model that claims a mechanism is responsible for an emergent property but does not demonstrate that removing the mechanism removes the property has not supported the claim. It has described a correlation between the mechanism’s presence and the property’s presence, which is the starting point for investigation, not the conclusion.

When Models Are Honest About Their Limits

A validated model is not a model that has passed every test. It is a model that has stated clearly which tests it passes, which tests it fails, and which tests it has not yet been subjected to.

State the violated assumptions. Every model makes assumptions that are known to be false in the target system. The Ising model assumes a regular lattice; real ferromagnets have defects. The SIR model assumes homogeneous mixing; real populations have structured contact networks. The Schelling model assumes costless relocation; real housing markets have transaction costs, credit constraints, and institutional discrimination. These violations are not disqualifying — models are useful precisely because they abstract away details. But the violations must be named, because they define the boundary of the model’s applicability. A model applied beyond that boundary is not being used; it is being misused.

State the failed predictions. A model that reports only the data it matches is an advertisement. Every model gets something wrong: the tails of a distribution, the transient dynamics before equilibrium, the behavior near a boundary condition, the response to a perturbation that was not in the training set. Reporting these failures is not a sign of weakness. It is the information that tells the next researcher where the model needs improvement and where its explanatory reach ends.

State the untested predictions. A model makes more predictions than any single study can test. The predictions that have not been tested are the model’s open claims — they are where the model is most vulnerable and most valuable. Listing them explicitly converts a static model into a research program: here is what the model says will happen, here is where you should look to confirm or refute it.

A model that follows this discipline — honest calibration, sensitivity analysis, ablation, and transparent reporting of limits — has earned the right to call its output explanatory. A model that skips these steps has produced an illustration. The illustration may be vivid, may be published, may be widely cited. It is still not an explanation.