Emergence Inside Neural Networks

In December 2021, a team at Anthropic published “A Mathematical Framework for Transformer Circuits.” The paper was not about what transformers can do. It was about what they are: what internal structures form when a transformer is trained, and whether those structures can be identified and understood.

The central finding was that transformers develop identifiable functional circuits — small subgraphs of attention heads and MLP layers that implement specific computational primitives. One of the most clearly characterized is the induction head: a two-layer attention pattern where the first head copies context from previous positions, and the second head uses that context to predict what comes next in repeated sequences. Induction heads were found consistently across models of different sizes and training distributions. They were not in the architecture specification; they formed during training, as a consequence of the training objective, in a way that is reproducible and to some degree predictable.

This is emergence in the strict sense: a collective property arising from the interaction of many components that is not a property of any individual component. No single weight is an induction head. The induction head is a pattern in how weights interact across two layers.


Grokking: Discontinuous Reorganization

In January 2022, Alethea Power and colleagues at OpenAI published “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.” They trained small transformers on modular arithmetic tasks — computing (a + b) mod p for various values of a, b, and p. The training sets were small enough that the models could memorize them perfectly. And they did: training accuracy reached 100% quickly, while validation accuracy remained near chance. Standard early stopping would have ended training here.

But Power and colleagues kept training. After ten to a hundred times more gradient steps than were needed to achieve perfect memorization, validation accuracy suddenly jumped — not gradually improving, but transitioning from near-zero to near-perfect within a few thousand steps. The model had grokked: it had reorganized its internal representations from a memorization strategy to a genuine generalization strategy, long after any conventional training criterion would have stopped.

The internal mechanism was subsequently investigated by Neel Nanda and colleagues, who found that the grokked model had learned to implement modular arithmetic using a specific algorithmic circuit based on Fourier features — essentially representing numbers as points on a circle and computing addition geometrically. The memorizing model had no such structure. The transition from memorization to generalization was a reorganization of the internal representation, not a smooth refinement of it.

This is a phase transition in representation space. The model’s behavior is discontinuous with respect to training time even though the training loss is not: loss decreases smoothly while internal structure reorganizes abruptly. The phenomenon has no analog in classical CA, where phase transitions are typically visible in the observable output. In neural networks, the phase transition can be invisible in the output (loss) while being dramatic in the internal structure.


Emergent Abilities at Scale

In 2022, Jason Wei and colleagues surveyed the capabilities of large language models across training scales and documented a pattern they called emergent abilities: sharp capability transitions where a task shows near-zero performance below a certain model scale and near-ceiling performance above it, with no gradual improvement in between.

The examples included multi-step arithmetic, logical reasoning tasks, and word-in-context disambiguation. On these tasks, models below roughly 10^22 training FLOPs perform at chance. Models above a threshold that varies by task perform dramatically better. The transition is abrupt at the level of task-level measurement; it does not correspond to a visible change in training loss.

A subsequent methodological critique (Schaeffer et al., 2023) argued that some apparent emergent abilities are artifacts of discontinuous evaluation metrics: if you measure performance with an all-or-nothing accuracy metric, a model that produces the last token wrong scores 0 on a multi-step problem even if it got every earlier step right. The apparent phase transition in this case is a consequence of the metric, not the model. Smooth metrics like log-probability produce smoother capability curves.

The critique is correct for some specific cases but does not eliminate the phenomenon. The Grokking results use continuous loss curves and show genuine discontinuous transitions in internal structure. Induction heads form suddenly during training with a measurable kink in the loss curve. At minimum, the mechanistic evidence for phase transitions in neural network training is robust, even where the behavioral evidence can be contested.


Why This Is Emergence

The property that makes neural network circuits genuinely emergent — rather than merely complex — is that the circuit is a collective property of the weights, not a property of any individual weight. You cannot identify a single weight that “implements” an induction head, any more than you can identify a single cell that “implements” a glider in Conway’s Life.

Superposition, another phenomenon documented by the Anthropic interpretability group, illustrates this. A layer of N neurons can represent more than N features simultaneously by distributing each feature across multiple neurons in a way that allows approximate recovery even when many features are active. The representation is holographic: any given neuron contributes to many features, and any given feature is encoded by many neurons. There is no single neuron you can point to and say “this neuron represents the concept of royalty.” The concept is a collective property of the layer.

This is not just technically true; it is the organizing principle of the layer’s function. The layer encodes more features than it has dimensions precisely because each feature is distributed. That’s the efficiency gain. The cost is exactly the interpretability problem: you can’t read the representations off by inspecting individual weights.

The analogy to CA is direct. A glider in Life is a property of a configuration of cells, not any individual cell. Changing any one cell in a glider changes or destroys it. The glider is real and identifiable, but it lives at the level of configuration rather than component. Neural circuits live at the level of weight interactions rather than individual weights. Understanding them requires the same kind of shift in analysis — from components to patterns of interaction — that studying emergent systems generally requires.


What Interpretability Can and Cannot Do

The mechanistic interpretability toolkit includes activation patching (intervene on an intermediate computation and measure downstream effects), attention visualization (look at which input positions attend to which other positions), and sparse dictionary learning (fit a sparse autoencoder to find the features a layer represents). Each tool gives partial information.

Activation patching can identify which components are causally necessary for a behavior — remove the induction head, and in-context learning degrades. It can’t tell you why the induction head formed or whether it is the optimal circuit for the task. Attention visualization can identify that a head attends to certain patterns; it can’t tell you what the model does with those attended values. Sparse autoencoders can identify features that a layer represents; the completeness and faithfulness of those features relative to the full computation is hard to assess.

The field is young and moving quickly, but it is also running into fundamental limits. Some of those limits are computational: the interaction space of a 70-billion-parameter model is too large to search exhaustively. Some are theoretical: we don’t have a framework that predicts which circuits will form under which training conditions.

What remains unsolved is a theory of circuit formation — not just that induction heads exist, but what property of the training distribution and architecture causes them to form, what conditions make a circuit stable versus transient, and whether the circuits that form are optimal (in the sense of maximizing some objective) or accidental (in the sense of being one of many solutions that would have worked equally well). Without that theory, mechanistic interpretability is an empirical science cataloging observations, waiting for the framework that will explain what it has found.


Further Reading