Closed AI models often undermine scientific inference by hiding construction and deployment details, making results hard to interpret, replicate, or generalize; researchers should explicitly document threats to inference, the mitigation steps taken, and why a given (closed or open) model was chosen.
How does the extent to which a model is open or closed impact the scientific inferences that can be drawn from research that involves it? In this paper, we analyze how restrictions on information about model construction and deployment threaten reliable inference. We argue that current closed models are generally ill-suited for scientific purposes, with some notable exceptions, and discuss ways in which the issues they present to reliable inference can be resolved or mitigated. We recommend that when models are used in research, potential threats to inference should be systematically identified along with the steps taken to mitigate them, and that specific justifications for model selection should be provided.
Summary
Main Finding
The paper argues that the increasing use of closed (hosted, black-box) language models in scientific research severely undermines reliable scientific inference. Three core inferential threats are identified — the versioning problem, the credit-assignment problem, and the insufficient-information problem — and the authors show that providing open model weights (and related artifacts) is the most direct and effective way to resolve or mitigate these threats. They recommend that researchers systematically identify inferential threats when using models, justify model choices, and document mitigations; model providers could also reduce harms by offering persistent versioning, probability outputs, and greater transparency.
Key Points
- Definitions
- CLOSED MODEL: hosted/black-box system accessible via chat/API; internal components (system prompts, filters, search, wrappers) are hidden.
- OPEN-WEIGHT MODEL: full model weights, tokenizer, architecture and runnable code provided so users can compute the model’s full output distribution.
- Three principal inferential threats from closed models:
- Versioning problem — hosted models change over time (undocumented updates, deprecations), so outputs from different executions are not reliably comparable and older versions often become unavailable.
- Credit-assignment problem — closed systems bundle language models with prompts, post-processing, APIs, guardrails and other components, so observed behavior cannot be attributed to the core model or to particular design features.
- Insufficient-information problem — closed models only expose sample text outputs (not the full probability distribution, not internal states), which limits evaluation (e.g., uncertainty estimates, low-probability alternatives) and makes many evaluation methods brittle or uninformative.
- Open weights largely resolve these threats by enabling:
- Stable, reproducible runs of a known artifact (mitigating versioning).
- Isolation and testing of the core model apart from system wrappers (resolving credit assignment).
- Access to probability distributions, logits, and internal states for richer, less brittle evaluation (addressing insufficient information).
- Open weights are necessary but not always sufficient: full reproducibility and interpretability require openness of other pipeline components (tokenizers, decoding strategy, system prompts, post-processing, hardware/software metadata).
- Practical mitigation short of full openness includes: providers offering explicit versioning and persistent access, exposing decoding/probability outputs, and allowing reproducible configuration choices. Researchers should at minimum report model version, execution date, and settings used.
- The paper focuses on three research goals affected by openness: model evaluation, model comparison, and model interpretability.
Data & Methods
- Type of contribution: conceptual analysis and structured argument (literature synthesis), not an original empirical experiment.
- Sources: extensive literature review and example citations from benchmark studies, model-release notes, and prior empirical findings (e.g., documented performance shifts across GPT versions, prompt sensitivity studies, evaluation metric critiques).
- Illustrative examples: documented differences across GPT-3.5/GPT-4 versions; compound-system behaviors (e.g., models augmented with calculators/APIs); sensitivity to system prompts and decoding strategy.
- Analytical approach: taxonomy of openness (closed vs open-weight), mapping of inferential goals to specific threats, and evaluation of what information (weights, tokenizers, prompts, decoding details, probability outputs, versioning guarantees) would mitigate each threat. Appendix A (not reproduced here) reportedly summarizes these mappings.
- Limitations: normative/conceptual focus rather than new empirical measurement; applicability demonstrated through examples and citations rather than systematic measurement across many commercial systems.
Implications for AI Economics
- Reproducibility and measurement: Closed models impede replication of empirical work that uses language models (e.g., as instruments, measurement tools, or simulated agents). For economists who rely on reproducible measures or cross-study comparisons (meta-analyses, long-run studies), lack of model versioning and missing probability outputs undermines validity of results and comparability across time.
- Inference about causal mechanisms and design features: Studies that compare architectures, training data regimes, or economic effects of deploying models require credit assignment. Closed compound systems hide what drives observed behavior, making it hard to infer causal links between model design and outcomes — weakening research on productivity effects, labor displacement modeling, or policy-counterfactual analyses.
- Market structure and competition analysis: The economic concentration of model-providing firms, combined with opaqueness, raises challenges for market-monitoring and antitrust analysis. If researchers and regulators cannot independently measure capability differences or audit behavior, it becomes harder to detect anti-competitive strategies, verify vendor claims, or assess entry barriers and platform externalities.
- Policy and regulation: Regulators aiming to assess risk, safety, or compliance need access to stable artifacts and meaningful diagnostics (e.g., uncertainty estimates). Closed models make robust auditing and certification difficult; open-weight access or mandated disclosure (versioning, access to probability outputs, logs of updates) would materially enhance regulatory capacity.
- Innovation and public goods: Open-weight models lower the cost of entry for academic and startups, enabling reproducible baseline research, method development, and public-good benchmarks. Conversely, closed models centralize innovation and knowledge, possibly slowing community-driven advances in measurement, verification, and econometric uses of LMs.
- Practical guidance for economists using LMs:
- Prefer open-weight models when studies aim to produce generalizable claims, comparisons, or mechanistic explanations.
- If using closed models, explicitly treat findings as tied to specific executions; document model name, apparent version, execution date, interface settings, and any observable post-processing. Where possible, obtain provider assurances (versioning, persistent access) or restrict claims accordingly.
- Avoid relying on closed LMs for inference that depends on calibrated confidence/probabilities unless the provider exposes those probabilities or unless the research is explicitly about the product-as-operated (not the underlying model).
- When studying market or policy questions about providers, stress-test conclusions given the nonstationarity and opaqueness of deployed systems.
- Trade-offs and policy levers: The paper highlights trade-offs between commercial incentives (protecting IP/competitive advantage, safety concerns) and scientific reproducibility/public-interest needs. For economic policy, this suggests a role for standards or incentives (e.g., procurement requirements, public funding conditionality, or disclosure mandates) to ensure critical models or those used in public-impact research offer sufficient transparency for reliable inference.
If you want, I can (a) extract specific recommendations the authors give for researchers and providers, (b) map the three inferential threats to checklist items economists should report in empirical papers, or (c) summarize Appendix A’s mapping of openness features to mitigations (if you can provide it).
Assessment
Claims (4)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Restrictions on information about model construction and deployment threaten reliable inference in research that involves those models. Research Productivity | negative | high | reliable inference / scientific inference |
0.02
|
| Current closed models are generally ill-suited for scientific purposes (with some notable exceptions). Research Productivity | negative | high | suitability of models for scientific research / quality of scientific inference |
0.02
|
| The inferential issues that closed models present can be resolved or mitigated by certain measures. Research Productivity | positive | high | reliability of inference after mitigation |
0.02
|
| When models are used in research, potential threats to inference should be systematically identified alongside the steps taken to mitigate them, and specific justifications for model selection should be provided. Research Productivity | positive | high | transparency and robustness of research inferences / research practices |
0.02
|