Automated Rule Discovery
In August 2024, Sakana AI released “The AI Scientist,” a fully automated research pipeline capable of generating novel hypotheses, writing code to test them, running experiments, analyzing results, and producing complete scientific papers — including the prose, figures, and references — without human intervention at any step. The system ran on a budget of roughly $15 per paper. Some of the outputs were genuinely novel. Some were wrong. None could tell the difference between the two.
The AI Scientist represents the sharp end of a trend that has been building for several years: using large language models and simulation pipelines to explore the space of possible rules in emergent systems. The approach is technically productive and scientifically ambiguous in ways that are worth understanding precisely.
The Workflow
The architecture is consistent across implementations, whether the goal is generating CA rules, agent interaction policies, or physical parameters. A generative model — typically an LLM or an evolutionary algorithm — proposes a candidate rule set. A simulator evaluates it: the rule runs forward in time, and the system records what happens. A fitness function scores the outcome against some target property (stability, complexity, self-replication, novelty relative to previously seen patterns). The scored results feed back into the generative model, either as fine-tuning signal or as few-shot examples in the next prompt. The loop runs at scale, evaluating thousands of candidates per hour.
Applied to cellular automata specifically, this means the system can explore rule spaces that are too large for manual search. The outer totalistic rules that generalize Conway’s Life — where birth/survival thresholds can take any combination of neighbor counts — constitute roughly 2^18 distinct rule tables. The full space of two-state, range-1 rules on a 2D grid runs to 2^512. Going to three states and asymmetric neighborhoods produces spaces with no tractable upper bound. Exhaustive human exploration of any of these is not feasible. Systematic automated exploration is.
Schmidhuber’s curiosity-driven search provides one of the cleaner theoretical frames for how the fitness function can be designed. A “curious” agent maximizes the compression gain it can achieve by learning from new observations — it seeks states that are neither fully predictable (boring) nor fully random (incompressible), but somewhere between: states where a world model can improve. Applied to CA rule spaces, curiosity-driven fitness assigns high scores to rules where the simulator’s behavior model is most surprised, pushing the search toward edges of the known.
What Automated Search Actually Finds
The results are not uniformly trivial. Automated searches have discovered Life-like rules supporting glider-class objects that were not previously documented, multi-state rules with self-replication properties, and parameter regimes in continuous CAs (Lenia and related systems) where organisms with stable internal structure emerge spontaneously. The search covers ground that a human researcher working manually would need years to cover.
The problem is that the system does not know what it has found.
When an automated pipeline discovers a glider in a previously unexplored rule table, the glider is recorded, scored (it moved, it persisted, it is novel relative to the training distribution), and filed. What the system cannot determine is whether this glider is mathematically interesting in the way that Life’s glider is interesting — whether it participates in universality, whether it can be composed to build logic gates, whether it is an isolated attractor or a representative of a broader class of self-propagating structures. Those judgments require domain knowledge that is not in the fitness function.
This is the “novelty without insight” problem stated concretely. Automated discovery is good at finding patterns that score well on measurable proxies. It is not good at identifying which patterns are windows onto something deeper. A researcher with domain knowledge can look at a glider in a new rule and ask: does this rule support glider collisions? Can those collisions be used for computation? Automated systems currently have no way to ask those questions because the questions require a conceptual framework that the fitness function doesn’t encode.
The Sakana AI Scientist ran into a version of this problem in a different domain: it could generate novel claims and test them empirically, but it could not evaluate whether a finding was significant in the context of a research field’s existing open questions. Some of its papers contained errors that a domain expert would have caught immediately. The pipeline produced outputs that looked like research papers; some were research papers in a meaningful sense; distinguishing one from the other required exactly the kind of judgment the system was designed to bypass.
Why the Search Space Is Hard
Langton’s lambda parameter — a measure of the fraction of transitions that lead to non-quiescent states — provides a rough map of the CA rule space. Near lambda = 0, most rules produce Class I behavior: blank grids, fixed points, immediate death. Near lambda = 1, most rules produce Class III behavior: random, chaotic, informationless. The interesting rules — Wolfram’s Class IV, which produce complex structures that can support universal computation — are rare attractors in rule space, clustered in a narrow band near the “edge of chaos.”
This is why exhaustive search is not enough and why smart search is hard. The vast majority of the rule space is boring, and the fitness function has to distinguish “complex in an interesting way” from “random in a way that scores as complex.” Most curiosity-driven signals will saturate on randomness — random rules are always surprising, in the sense that they are hard to predict, but they are not interesting. The Lempel-Ziv complexity of a random pattern is high; so is the Lempel-Ziv complexity of Life’s glider gun, but for completely different reasons.
The Open Problem: Automated Significance Scoring
Kolmogorov complexity has been proposed as a proxy for interestingness in this setting. The idea: patterns that resist compression are surprising in a meaningful sense — they contain structure that a simple model cannot capture. A truly random pattern has high Kolmogorov complexity, but so does any pattern with rich internal structure. The difference is that a random pattern has high complexity because it has no structure; an interesting pattern has high complexity because its structure is irreducible, not because it is absent.
In principle, Kolmogorov complexity could be used to distinguish these cases: a random pattern should have complexity close to its length, while an interesting pattern should have complexity somewhere in the middle — compressible relative to randomness, incompressible relative to simple periodic patterns. Compressed length measures (using real-world compressors like zstd or bz2 as approximations to Kolmogorov complexity) are computationally tractable and have been used in preliminary experiments.
The difficulty is that these measures are noisy approximations, and the compression gains from structural interest tend to be small relative to the variance in the approximation. More fundamentally, Kolmogorov complexity measures descriptive complexity, not mathematical significance. A pattern can be irreducibly complex without being scientifically important.
What remains unsolved is a significance scoring function that does what a domain expert does: recognizes when a newly discovered pattern is a representative of a known class, when it is genuinely novel, and when the novelty is mathematically deep. That function would need to encode something like a research program — a set of open questions against which discoveries can be evaluated. Building that into an automated pipeline is the hard problem, and it has not been solved.