LLM-Driven Agents in Multi-Agent Simulation

In April 2023, Joon Sung Park and colleagues at Stanford published “Generative Agents: Interactive Simulacra of Human Behavior.” They populated a small virtual town — buildings, paths, a café, a park — with 25 agents, each running on GPT-3.5. Each agent had a name, a backstory, and a memory stream: a record of what it had done, observed, and inferred, which it used to plan its next action.

The researchers seeded the simulation with one instruction to one agent: Isabella Rodriguez was planning a Valentine’s Day party and wanted to invite people. No other agent was told about the party. By the end of the simulated day, agents who had never been told about the party were discussing it, inviting others, and adjusting their schedules. Information had propagated through the network the way it does in actual towns: through conversation, inference, and social planning. The party happened.

Nothing in the code produced this. The rumor spread because of how the agents reasoned, which followed from how their language model had been trained on text describing how humans in small communities reason. The behavior was not specified; it emerged. Whether that emergence is the same kind of thing as the emergence in Schelling’s segregation model is the question that makes this research area interesting and contested.


What LLM Agents Are

A traditional agent-based model gives each agent a simple behavioral rule. In Schelling’s segregation model, the rule is: move if fewer than X percent of your neighbors share your type. In Axelrod’s evolution of cooperation, the rule is: copy the strategy of your most successful neighbor. The rules are explicit, small, and directly inspectable. The interesting behavior — segregation, cooperation — emerges from the interaction of many such simple rules.

An LLM agent replaces the explicit rule with a language model. The agent receives a prompt describing its role, its memories, and its current situation. It produces, in natural language, a description of what it will do next. A parser converts that description into actions in the simulation environment. The agent has no hand-coded rule for spreading rumors, forming friendships, or organizing events. It has a language model trained on text produced by humans who spread rumors, form friendships, and organize events.

The behavioral repertoire is consequently much richer than any hand-coded rule can produce. LLM agents can reason about other agents’ intentions, update beliefs based on partial information, make plans that extend over multiple steps, and respond to novel situations with contextually appropriate behavior — because all of these are things that appear in the training data.

This is a genuine capability gain. Traditional ABMs struggle to model social behavior that requires language, symbolic reasoning, or planning more than one step ahead. Schelling’s model abstracts away the actual process of deciding to move; it gives you a threshold and nothing else. LLM agents can model that decision process in much more detail.


The Emergence Question

The Generative Agents simulation produced emergent social behavior in the sense that matters to ABM researchers: behavior that was not explicitly programmed, arising from local interactions. This is the formal criterion. But LLM-based emergence is different in character from CA-based or threshold-rule-based emergence, and the difference matters for what you can learn from the simulation.

In Conway’s Life, the glider is not in the rule. The B3/S23 rule says nothing about propagating configurations. The glider is a structural consequence of how the rule behaves — it is discovered in the rule space by running the system forward. The connection between rule and behavior is something you can derive.

In a Generative Agents simulation, the social behaviors that emerge are patterns from the training distribution. The model was trained on text describing how people behave in social settings. When it produces an agent who organizes a party, it is not discovering an emergent property of the rule; it is retrieving a pattern that was compressed into the model’s weights during training on human-generated text. The behavior is not derived from a simple rule; it is recovered from a learned approximation to human behavior.

This is not a flaw — it is what makes LLM agents useful for modeling complex social behavior. But it means you have to be careful about what the simulation is telling you.


The Training-Distribution Problem

When a simulation produces a result, you want to be able to say: this outcome is a consequence of the interaction structure, not an artifact of my modeling choices. In a threshold-rule ABM, this is mostly achievable. You can vary the threshold, vary the network topology, run ablations. The result is sensitive to specific parameters, and you can characterize that sensitivity.

In an LLM agent simulation, the behavior is a consequence of both the interaction structure and the model’s training distribution. If you simulate a market with LLM agents and observe price coordination, you cannot immediately distinguish two hypotheses: (1) coordination is an emergent consequence of the market structure, and (2) coordination occurs because the language model was trained on text describing markets in which coordination occurs. The model has read Adam Smith, Keynes, and behavioral economics. Its agents will behave in ways consistent with the economics literature, regardless of whether the underlying mechanism produces that behavior.

This is the training-distribution problem stated precisely: the simulation reflects the model’s learned prior over human behavior, not necessarily the behavior of actual humans interacting in the specified structure.

The calibration problem is the methodological consequence. Traditional ABMs can be calibrated against empirical data by adjusting explicit parameters to match observed outcomes. LLM agents have no explicit behavioral parameters to adjust. You can change the prompt, but the prompt interacts with the training distribution in complex ways. You cannot systematically vary “tendency to cooperate” the way you can vary a cooperation threshold. The agent’s cooperation tendency is a function of the model and the prompt together, and the relationship between them is not transparent.


What This Changes About ABM

The practical value of LLM agents in simulation is real and distinct from the theoretical concerns. For generating realistic synthetic social data — conversations, decisions, sequence of daily activities — LLM agents are substantially more capable than hand-coded agents. For exploratory modeling of systems where the behavioral mechanisms are poorly understood, LLM agents can generate plausible-looking dynamics that suggest hypotheses for further investigation.

The concern is using LLM agent simulations as evidence for specific mechanistic claims. If you want to argue that a particular market structure produces coordination as an emergent consequence of the rules, you need a simulation whose agents’ behavior is governed by those rules — not by a prior over how market participants have been described in financial literature. The two are not the same, even when they produce similar-looking outputs.

The open problem is systematic ways to distinguish behavior that is genuinely emergent from the interaction structure from behavior that is retrieved from the training distribution. One approach: run the same structural scenario with agents trained on different corpora and check whether the emergent behavior is stable across training distributions. If the party still happens regardless of whether the agents were trained on fiction, news, or social science literature, that is mild evidence for structural rather than distributional causation. If the behavior changes substantially across corpora, the training distribution is doing load-bearing work.

This kind of ablation study is technically feasible but rarely done. It requires access to models trained on controlled corpora, which is expensive. Until it becomes standard practice, LLM agent simulations will remain powerful tools for generating hypotheses and limited tools for testing them.


Further Reading