Large language models can serve as cheap, large-scale instruments for measuring cultural norms, beliefs and narratives—by probing pretrained weights economists can approximate aggregate discourse—yet alignment, fine-tuning and sampling choices can distort those signals, so researchers should prefer base or minimally adapted models, validate against surveys, and publish provenance and prompts.

The Third Ambition: Artificial Intelligence and the Science of Human Behavior

W. R. Neuman, Chad Coleman · Fetched March 15, 2026

semantic_scholar descriptive n/a evidence 7/10 relevance Source

LLMs can function as scalable, low-cost scientific instruments that surface aggregate cultural, normative, and argumentative patterns from their pretrained corpora, but robust inference requires careful choice of base versus tuned models, validation against ground truth, and sensitivity checks for alignment and sampling effects.

Contemporary artificial intelligence research has been organized around two dominant ambitions: productivity, which treats AI systems as tools for accelerating work and economic output, and alignment, which focuses on ensuring that increasingly capable systems behave safely and in accordance with human values. This paper articulates and develops a third, emerging ambition: the use of large language models (LLMs) as scientific instruments for studying human behavior, culture, and moral reasoning. Trained on unprecedented volumes of human-produced text, LLMs encode large-scale regularities in how people argue, justify, narrate, and negotiate norms across social domains. We argue that these models can be understood as condensates of human symbolic behavior, compressed, generative representations that render patterns of collective discourse computationally accessible. The paper situates this third ambition within long-standing traditions of computational social science, content analysis, survey research, and comparative-historical inquiry, while clarifying the epistemic limits of treating model output as evidence. We distinguish between base models and fine-tuned systems, showing how alignment interventions can systematically reshape or obscure the cultural regularities learned during pretraining, and we identify instruct-only and modular adaptation regimes as pragmatic compromises for behavioral research. We review emerging methodological approaches including prompt-based experiments, synthetic population sampling, comparative-historical modeling, and ablation studies and show how each maps onto familiar social-scientific designs while operating at unprecedented scale.

Summary

Main Finding

Large language models (LLMs) constitute a third research ambition for contemporary AI: they can be used as scientific instruments to study human behavior, culture, and moral reasoning. Trained on vast corpora of human-produced text, LLMs act as compressed, generative “condensates” of collective symbolic behavior that make large-scale patterns in discourse computationally accessible—if used with awareness of their epistemic limits and the effects of alignment/fine-tuning.

Key Points

New ambition in AI research: beyond productivity (tools) and alignment (safety), LLMs as instruments for social and behavioral science.
LLMs encode large-scale regularities in argumentation, justification, narration, and norm negotiation; they can surface cultural patterns at scale.
Distinction between model types matters:
- Base (pretrained) models retain broad cultural regularities learned from raw corpora.
- Fine-tuned / aligned models can systematically reshape or obscure those regularities.
- Instruct-only and modular adaptation regimes are pragmatic compromises for behavioral research (preserve pretraining knowledge while enabling safe, controllable behavior).
Epistemic limits:
- Model output is not direct evidence of individual human beliefs or frequencies; it reflects the training distribution, pretraining artifacts, and modeling choices.
- Alignment and safety interventions can remove, blunt, or bias signals from pretraining corpora.
- Sampling choices (temperature, decoding method), prompt phrasing, and tokenization affect results and representativeness.
Methodological approaches reviewed:
- Prompt-based experiments (analogous to survey/experiment designs).
- Synthetic population sampling (constructing weighted prompt sets to approximate populations).
- Comparative-historical modeling (using models to simulate or reconstruct discourse across time/space).
- Ablation studies (identify which features/weights contribute to behaviors).
Mapping to social-scientific traditions: computational social science, content analysis, survey research, comparative-historical inquiry—LLMs scale and automate many of these tasks but require adapted inferential logic.

Data & Methods

Data source: pretrained LLMs that have absorbed very large, heterogeneous collections of human text (web, books, forums, etc.). The paper treats model weights as compressed summaries of those texts.
Methods discussed and their operationalization:
- Prompt-based experiments:
  - Design prompt families that mirror survey questions or experimental vignettes.
  - Vary framing, demographic cues, context to probe sensitivities.
  - Use probabilistic sampling of tokens to produce distributions of responses rather than single outputs.
- Synthetic population sampling:
  - Construct sets of prompts representing demographic, geographic, ideological subpopulations.
  - Weight or sample prompts to approximate target population mixes.
- Comparative-historical modeling:
  - Condition on temporal markers or corpora slices (when possible) to study changes in discourse over time.
  - Use models trained on era-specific corpora or fine-tuned on historical texts.
- Ablation and modular analysis:
  - Remove or freeze modules (e.g., instruction-tuning layers) to see which components carry cultural signals.
  - Run controlled fine-tuning (instruct-only vs full fine-tune) to measure how alignment alters outputs.
Validation and robustness:
- Cross-model comparisons (different architectures, vintages, pretraining corpora).
- Ground-truthing against surveys, human annotations, and external data where available.
- Sensitivity checks for sampling temperature, decoding method, prompt phrasing, and model version.
Practical recommendations:
- Prefer base or minimally adapted models when the goal is cultural signal recovery.
- If safety is a concern, consider modular approaches that preserve pretrained layers while adding constrained interfaces.
- Publish model provenance (pretraining corpus, date, instruction-tuning procedures) and all prompt sets and sampling parameters.

Implications for AI Economics

Measurement and inference
- LLMs offer cost-effective, high-throughput proxies for measuring aggregate beliefs, narratives, and norms—potential substitutes or complements to traditional surveys and content analysis.
- They can generate synthetic populations or counterfactual discourse to study likely responses to policy, messaging, or shocks at scale.
- Economists must treat LLM outputs as samples from a model-conditioned discourse distribution, not direct samples of people; calibration to survey data and careful validation are essential.
Forecasting and scenario analysis
- LLMs can help generate scenario narratives, market sentiment summaries, and plausible agent responses for macroeconomic and policy stress tests.
- Use with ensembles and cross-validation to mitigate model-specific biases.
Policy design and regulation
- Regulatory analysis can use LLMs to anticipate behavioral adaptations to policy changes (e.g., tax policy framing, regulation of platforms).
- But alignment/fine-tuning policies by model providers may obscure or bias the signals regulators rely on; transparency about model training and tuning is economically important.
Labor and productivity research
- Cultural and normative signals embedded in LLMs can inform studies of task substitution, complementarities, and adoption costs for AI in workplaces (e.g., variation in how workers would integrate or resist AI tools across cultures).
Market and welfare analysis
- LLMs can surface moral preferences and normative constraints that affect demand, compliance, and firm strategies (e.g., ethical consumption trends, reputational risk dynamics).
- Misreading model outputs as representative human preferences risks misallocating policy or firm effort—leading to welfare losses.
Risks and caveats for economic work
- Representativeness: training corpora are uneven across geographies, languages, and socioeconomic groups—biases propagate into inferences.
- Alignment interventions: commercial models often undergo alignment that can remove or distort important cultural signals; economic analysis must account for these interventions when interpreting outputs.
- Overreliance and feedback loops: using LLM-generated content to train policy or inform markets can create feedback loops that amplify model artifacts into real-world behavior.
Recommendations for economists using LLMs
- Use base or minimally adapted models when studying cultural patterns; if using aligned models, document the alignment regime and test its effects.
- Validate LLM-based measures against independent survey or behavioral data; report uncertainty and sensitivity to prompts and sampling.
- Combine LLM-generated insights with field or lab experiments for causal claims; use LLMs primarily for measurement, hypothesis generation, and large-scale descriptive work.
- Advocate for model transparency (provenance, corpora summaries, tuning procedures) to improve reproducibility and policy relevance.

Summary takeaway: LLMs are powerful, scalable instruments for studying cultural and normative aspects relevant to economic questions, but robust inference requires deliberate methodological choices, careful validation, and attention to how alignment and sampling decisions shape the signal.

Assessment

Paper Typedescriptive Evidence Strengthn/a — Conceptual/methodological paper that synthesizes approaches and proposals rather than reporting new empirical causal estimates; no primary causal identification or outcome measurement presented. Methods Rigormedium — Careful, systematic review of methodological options with sensible validation and robustness checks recommended (cross-model comparisons, ground-truthing, ablations), but the paper does not present systematic empirical validation of the proposed methods or quantify their bias/variance in real-world settings. SampleTheoretical/empirical substrate: pretrained LLMs trained on very large, heterogeneous text corpora (web pages, books, forums, social media, news) of varying provenance and time spans; distinguishes base (pretrained) models from instruction-tuned/fine-tuned/aligned variants and discusses sampling regimes (temperature, decoding) and prompt families used to generate synthetic response distributions. Themeshuman_ai_collab adoption GeneralizabilityTraining corpora uneven across geographies, languages, and socioeconomic groups—signals biased toward overrepresented populations, Alignment and fine-tuning procedures (often opaque) can systematically alter or obscure pretrained cultural signals, Model versions and provider updates cause temporal instability—results may not replicate across vintages, Proprietary models with limited provenance information reduce reproducibility and external validation, Sampling choices, prompt phrasing, and tokenization affect representativeness of generated outputs, Architectural differences across models mean findings may not generalize between model families

Claims (9)

Claim	Direction	Confidence	Outcome	Details
Contemporary artificial intelligence research has been organized around two dominant ambitions: productivity (treating AI systems as tools for accelerating work and economic output) and alignment (ensuring increasingly capable systems behave safely and in accordance with human values). Ai Safety And Ethics	null_result	high	categorization of dominant research ambitions in contemporary AI (productivity vs. alignment)	0.03
There is a third, emerging ambition in AI research: using large language models (LLMs) as scientific instruments for studying human behavior, culture, and moral reasoning. Research Productivity	positive	medium	feasibility and conceptual framing of LLMs as tools for social-scientific inquiry	0.02
Trained on unprecedented volumes of human-produced text, LLMs encode large-scale regularities in how people argue, justify, narrate, and negotiate norms across social domains. Ai Safety And Ethics	positive	medium	presence of encoded large-scale linguistic and cultural regularities in pretrained LLM representations	0.02
LLMs can be understood as condensates of human symbolic behavior—compressed, generative representations that render patterns of collective discourse computationally accessible. Ai Safety And Ethics	null_result	medium	conceptual characterization of LLMs (as condensed representations of collective discourse)	0.02
Model output can be treated as evidence for studying human behavior, but there are important epistemic limits to interpreting model-generated text as direct evidence of human beliefs or social facts. Ai Safety And Ethics	mixed	high	validity and limits of using LLM outputs as evidence about human behavior and social phenomena	0.03
Alignment interventions (e.g., fine-tuning, instruction-following adjustments) can systematically reshape or obscure the cultural regularities learned during pretraining. Ai Safety And Ethics	negative	medium	degree to which cultural regularities from pretraining are preserved or obscured after alignment interventions	0.02
Instruct-only and modular adaptation regimes constitute pragmatic compromises for behavioral research because they can preserve pretrained cultural regularities while allowing researchers to elicit targeted behaviors. Research Productivity	positive	medium	balance between preserving pretrained cultural patterns and enabling controlled elicitation in research settings	0.02
A set of emerging methodological approaches—prompt-based experiments, synthetic population sampling, comparative-historical modeling, and ablation studies—map onto familiar social-scientific designs while operating at unprecedented scale. Research Productivity	positive	medium	applicability and scalability of LLM-based methods for social-scientific research designs	0.02
Distinguishing between base models and fine-tuned systems is important for researchers using LLMs to study cultural patterns, because fine-tuning and alignment can change the behaviors relevant to behavioral research. Research Productivity	null_result	high	impact of model provenance (base vs fine-tuned) on suitability for behavioral/cultural analysis	0.03