you ask an LLM to simulate 6,000 American households answering questions about inflation? Recent papers find that large language models can replicate the average responses of major household surveys to within a percentage point (Zarifhonarvar, 2026). In 2020, the Survey of Consumer Expectations (SCE) reported a one-year-ahead median inflation rate of about 3%. The median produced by a prompted LLM with realistic personas and a knowledge-cutoff instruction: also about 3%. Close enough that LLMs have been pitched as a low-cost, high-frequency complement to the SCE, Michigan, and Survey of Professional Forecasters surveys.

In a recent paper, Can LLMs Mimic Household Surveys?, co-authored with Ami Dalloul from the University of Duisburg-Essen, we look at the second moment, the part of a probability distribution that tells you whether the model represents one opinion or a thousand. It is here that the apparent success of LLM-based surveys disappears. The same Llama-3 model that hits the SCE median to within a percentage point places 95% of its simulated respondents inside a two-percentage-point window. The real 2020 SCE responses range from roughly minus 25 to plus 27 percent. In short, the average is right, but the population behind it does not exist. So running a simulation with several thousand LLM personas boils down to one representative agent.

Figure 1: Dispersion of Real-World and Synthetic Survey Populations

Note: The left panel plots the dispersion of individual 2020 SCE respondents around their mean. Diffuse radiation reflects heterogeneous beliefs across respondents. The middle panel applies the same construction to synthetic responses from a Llama-3.1-8B-Instruct model prompted with personas matching the SCE demographic distribution. The scatter collapses to a near-point. The model recovers the mean and discards everything else. The right panel uses the same Llama model unlearned with gradient ascent (GA). The unlearned model achieves a more realistic dispersion and does not collapse around the mode.

Mode collapse

We benchmarked five LLMs (Llama-3-8B, Llama-3-70B, Claude-3.7-Sonnet, DeepSeek-V3, GPT-4o) against the SCE, Michigan Survey, and Survey of Professional Forecasters. In the human surveys, 44 to 70% of respondents give answers more than 3 percentage points away from the modal reply; in the LLM samples, that share is essentially zero.

The standard remedies from the survey-simulation literature do not improve this problem. Census-derived personas with complex and varying characteristics, zero-shot knowledge-cutoff instructions (“you do not know events after June 2018”), and explicit “do not look up statistics” prompts all default to the same narrow distribution. The likely cause is that the LLMs see CPI tables, news coverage of FRBNY survey releases, and academic replications in their training corpora. Asked for the median 2020 inflation expectation, the model is doing retrieval against memorized data. The weight of that training data overpowers whatever the prompt instructions ask it to do.

Unlearning the LLMs

If memorized statistics are the problem, a potential fix is to remove them from the weights rather than ask the model to look away. We applied two unlearning methods to Llama-3.1-8B-Instruct, an open-source model that allows us to modify its weights:

  • Gradient Ascent (GA) maximizes prediction loss on a forget set of CPI series and survey aggregates, with a retain loss on micro-survey reasoning so general capability survives.
  • Negative Preference Optimization (NPO) treats the forget set as dispreferred completions and minimizes a bounded preference loss against a reference model.

The data we ask the model to forget is the official inflation record itself: monthly CPI series and published mean inflation expectations from the FRBNY SCE and Michigan surveys. The unlearning effect on the response distribution is in Table 1.

Table 1 Tail Accuracy with Different Unlearning Strategies

Note: Unlearning strategies to mitigate mode collapse. Gradient ascent (GA) is a targeted unlearning method where the model is fine-tuned to maximize loss on a dataset of official CPI statistics while minimizing loss, or retaining (RT), on a dataset of micro-survey data. Negative preference optimization (NPO) treats official statistics as negative samples to penalize their generation while treating retaining (RT) samples as positive. Synthetic survey replies of inflation expectations as percentage deviations from the mode and mean (in brackets) within bins of exact matches, ± 1, and > 3 % deviations. Tail Acc. measures closeness to the FRBNY tail dispersion benchmark (> ± 3.0 = 44.38).

The baseline Llama-3 (which includes prompt-based unlearning) produces an exact mode match on 92% of replies and zero replies more than 3pp away. Tail accuracy against the SCE benchmark of 44% is therefore zero. After GA, exact matches drop to 24%, and 43% of replies move beyond ±3pp; tail accuracy reaches 97%. NPO is comparable at 37% and 43%, with 98% tail accuracy. In other words, both unlearning methods appear to recover a more realistic distribution.

Figure 2 Dispersion of LLMs vs. Unlearning Models

Note: The left-hand side plots kernel density estimates of 2020 inflation expectations from the FRBNY SCE and two Llama-3 variants trained with unlearning methods, gradient ascent (GA) and negative preference optimization (NPO). Both unlearning variants cover the range where FRBNY SCE places probability mass, though they still remain more concentrated than the human benchmark and slightly skewed to higher means. The right-hand side compares the KDEs of prompted LLM-generated expectations (GPT-4o, Llama-3, etc.) to FRBNY SCE in 2020. The LLM curves (left axis) are tightly clustered around a narrow region, while the FRBNY SCE curve remains much broader. The LLMs can match central tendency yet fail to reproduce the cross-sectional spread of survey micro-data. Bandwidth = 0.5 for all KDEs.

The kernel densities (Figure 2) show that off-the-shelf models pile probability mass into a thin spike near the mean. The unlearned variants spread mass across the range where the human respondents of the SCE put it.

Simulating a randomized controlled trial

A wider distribution is necessary but not sufficient for the application that motivated our paper: replicating survey RCTs with synthetic versions. RCTs are expensive. After data collection ends, a researcher cannot go back to test a theory that emerged later or vary a treatment. Synthetic agents would let us do exactly that, if their behavior matches what real respondents produce.

To test this, we replicate a real-world RCT by Coibion, Gorodnichenko, and Weber (2022). Respondents are randomly assigned to one of several groups: a control group sees no information, several treatment groups each receive a different economic piece of information (the actual past inflation rate, the Fed’s 2% target, etc.), and a placebo group is shown content unrelated to inflation. All respondents first report a prior inflation expectation, then see whatever their group is assigned, and then report a new posterior expectation. The difference between posterior and prior is the respondent’s revision.

A treatment works if its revisions differ visibly from the control group’s, and if the direction of the shift matches what economic theory expects: downward revisions from FOMC communication, upward revisions from news of higher gasoline prices. The check for our synthetic agents is whether their revisions separate the same way the human respondents did.

We built 30,000 synthetic personas with Census-derived demographics, and estimated the average treatment effect on each of the three LLMs, including our unlearned ones. The first check is on the priors themselves: the inflation expectations agents report before they see any information. Figure 3 plots the mean and standard deviation of these priors across demographic subgroups for the human benchmark and the three LLMs. One unlearning model (Llama-GA) comes close to the human aggregate in both level and dispersion. While one unlearning method worked (GA), the other did not (NPO). So unlearning may not be a one-size-fits-all remedy.

Figure 3 Model Estimates of Perceived Inflation

Note: Each panel plots by demographic subgroup for the human benchmark (Coibion et al., 2022), the baseline Llama-3, and its two unlearned variants (GA, NPO). The dashed line marks the human “All” value. Left-hand side: Llama-3 and Llama-NPO are essentially flat across demographic characteristics; Llama-GA tracks the human level on average but does not reproduce the within-demographic ordering (e.g. predicting the highest mean for “college or more” and “Inc T3,” contrary to the human pattern). Right-hand side: the unlearned GA model recovers most of the dispersion collapsed by the base model.

The next check is on how the priors get updated after the information treatment. In the baseline Llama-3 and Llama-NPO models, revisions are essentially identical across every treatment and the models do not register a treatment effect at all. Llama-GA is the only one where the treatments separate, and within its largest subgroup of agents (80% of the sample) the four monetary-policy treatments (past inflation, Fed target, FOMC forecast, FOMC statement) produce negative and significant revisions of the same sign and rough magnitude as the human respondents in Coibion et al.

What to take from this

For researchers and practitioners deciding whether to use LLMs to conduct surveys, the summary is:

  • LLMs are unable to imitate different personas. Simulating surveys comes down to one agent answering the same question thousands of times, hitting something very close to the mean every time, sometimes up to four decimal places.
  • Targeted unlearning recovers most of the dispersion and a respectable share of the treatment effects in an RCT with human respondents. However, unlearning methods achieve different levels of success.
  • The gap between mean accuracy and distributional accuracy is large enough that any paper using synthetic respondents should report the second.

Future work should treat distributional accuracy and data leakage as joint constraints rather than secondary concerns. Progress will depend on methods that account for both what models know and how their outputs are evaluated, with greater attention paid to dispersion, tails, and belief updating rather than averages alone.

References

Coibion, O., Y. Gorodnichenko, and M. Weber (2022). Monetary policy communications and their effects on household inflation expectations. Journal of Political Economy 130(6), 1537–1584.

Dalloul, A., Pfeifer, M. (2026). Can LLMs Mimic Household Surveys?: From Representative Agents to Population Distributions. SSRN preprint. Link to working paper

Zarifhonarvar, A. (2026). Generating inflation expectations with large language models. Journal of Monetary Economics 157, 103859

Replication Data

Dalloul, A., Pfeifer, M. (2026). Replication Data for: “Can LLMs Mimic Household Surveys?: From Representative Agents to Population Distributions”, https://doi.org/10.7910/DVN/CRIRVJ, Harvard Dataverse, V1.



Source link