Closed AI models often undermine scientific inference by hiding construction and deployment details, making results hard to interpret, replicate, or generalize; researchers should explicitly document threats to inference, the mitigation steps taken, and why a given (closed or open) model was chosen.

How Open Must Language Models be to Enable Reliable Scientific Inference?

James A. Michaelov, Catherine Arnett, Tyler A. Chang, Pamela D. Rivière, Samuel M. Taylor, Cameron R. Jones, Sean Trott, Roger P. Levy, Benjamin K. Bergen, Micah Altman · March 27, 2026

arxiv theoretical n/a evidence 7/10 relevance Source PDF

The paper argues that closed AI models often prevent reliable scientific inference by obscuring model construction and deployment details, and recommends systematic identification of inference threats, mitigation steps, and explicit justifications for model choice in research.

How does the extent to which a model is open or closed impact the scientific inferences that can be drawn from research that involves it? In this paper, we analyze how restrictions on information about model construction and deployment threaten reliable inference. We argue that current closed models are generally ill-suited for scientific purposes, with some notable exceptions, and discuss ways in which the issues they present to reliable inference can be resolved or mitigated. We recommend that when models are used in research, potential threats to inference should be systematically identified along with the steps taken to mitigate them, and that specific justifications for model selection should be provided.

Summary

Main Finding

The paper argues that the increasing use of closed (hosted, black-box) language models in scientific research severely undermines reliable scientific inference. Three core inferential threats are identified — the versioning problem, the credit-assignment problem, and the insufficient-information problem — and the authors show that providing open model weights (and related artifacts) is the most direct and effective way to resolve or mitigate these threats. They recommend that researchers systematically identify inferential threats when using models, justify model choices, and document mitigations; model providers could also reduce harms by offering persistent versioning, probability outputs, and greater transparency.

Key Points

Definitions
- CLOSED MODEL: hosted/black-box system accessible via chat/API; internal components (system prompts, filters, search, wrappers) are hidden.
- OPEN-WEIGHT MODEL: full model weights, tokenizer, architecture and runnable code provided so users can compute the model’s full output distribution.
Three principal inferential threats from closed models:
Versioning problem — hosted models change over time (undocumented updates, deprecations), so outputs from different executions are not reliably comparable and older versions often become unavailable.
Credit-assignment problem — closed systems bundle language models with prompts, post-processing, APIs, guardrails and other components, so observed behavior cannot be attributed to the core model or to particular design features.
Insufficient-information problem — closed models only expose sample text outputs (not the full probability distribution, not internal states), which limits evaluation (e.g., uncertainty estimates, low-probability alternatives) and makes many evaluation methods brittle or uninformative.
Open weights largely resolve these threats by enabling:
- Stable, reproducible runs of a known artifact (mitigating versioning).
- Isolation and testing of the core model apart from system wrappers (resolving credit assignment).
- Access to probability distributions, logits, and internal states for richer, less brittle evaluation (addressing insufficient information).
Open weights are necessary but not always sufficient: full reproducibility and interpretability require openness of other pipeline components (tokenizers, decoding strategy, system prompts, post-processing, hardware/software metadata).
Practical mitigation short of full openness includes: providers offering explicit versioning and persistent access, exposing decoding/probability outputs, and allowing reproducible configuration choices. Researchers should at minimum report model version, execution date, and settings used.
The paper focuses on three research goals affected by openness: model evaluation, model comparison, and model interpretability.

Data & Methods

Type of contribution: conceptual analysis and structured argument (literature synthesis), not an original empirical experiment.
Sources: extensive literature review and example citations from benchmark studies, model-release notes, and prior empirical findings (e.g., documented performance shifts across GPT versions, prompt sensitivity studies, evaluation metric critiques).
Illustrative examples: documented differences across GPT-3.5/GPT-4 versions; compound-system behaviors (e.g., models augmented with calculators/APIs); sensitivity to system prompts and decoding strategy.
Analytical approach: taxonomy of openness (closed vs open-weight), mapping of inferential goals to specific threats, and evaluation of what information (weights, tokenizers, prompts, decoding details, probability outputs, versioning guarantees) would mitigate each threat. Appendix A (not reproduced here) reportedly summarizes these mappings.
Limitations: normative/conceptual focus rather than new empirical measurement; applicability demonstrated through examples and citations rather than systematic measurement across many commercial systems.

Implications for AI Economics

Reproducibility and measurement: Closed models impede replication of empirical work that uses language models (e.g., as instruments, measurement tools, or simulated agents). For economists who rely on reproducible measures or cross-study comparisons (meta-analyses, long-run studies), lack of model versioning and missing probability outputs undermines validity of results and comparability across time.
Inference about causal mechanisms and design features: Studies that compare architectures, training data regimes, or economic effects of deploying models require credit assignment. Closed compound systems hide what drives observed behavior, making it hard to infer causal links between model design and outcomes — weakening research on productivity effects, labor displacement modeling, or policy-counterfactual analyses.
Market structure and competition analysis: The economic concentration of model-providing firms, combined with opaqueness, raises challenges for market-monitoring and antitrust analysis. If researchers and regulators cannot independently measure capability differences or audit behavior, it becomes harder to detect anti-competitive strategies, verify vendor claims, or assess entry barriers and platform externalities.
Policy and regulation: Regulators aiming to assess risk, safety, or compliance need access to stable artifacts and meaningful diagnostics (e.g., uncertainty estimates). Closed models make robust auditing and certification difficult; open-weight access or mandated disclosure (versioning, access to probability outputs, logs of updates) would materially enhance regulatory capacity.
Innovation and public goods: Open-weight models lower the cost of entry for academic and startups, enabling reproducible baseline research, method development, and public-good benchmarks. Conversely, closed models centralize innovation and knowledge, possibly slowing community-driven advances in measurement, verification, and econometric uses of LMs.
Practical guidance for economists using LMs:
- Prefer open-weight models when studies aim to produce generalizable claims, comparisons, or mechanistic explanations.
- If using closed models, explicitly treat findings as tied to specific executions; document model name, apparent version, execution date, interface settings, and any observable post-processing. Where possible, obtain provider assurances (versioning, persistent access) or restrict claims accordingly.
- Avoid relying on closed LMs for inference that depends on calibrated confidence/probabilities unless the provider exposes those probabilities or unless the research is explicitly about the product-as-operated (not the underlying model).
- When studying market or policy questions about providers, stress-test conclusions given the nonstationarity and opaqueness of deployed systems.
Trade-offs and policy levers: The paper highlights trade-offs between commercial incentives (protecting IP/competitive advantage, safety concerns) and scientific reproducibility/public-interest needs. For economic policy, this suggests a role for standards or incentives (e.g., procurement requirements, public funding conditionality, or disclosure mandates) to ensure critical models or those used in public-impact research offer sufficient transparency for reliable inference.

If you want, I can (a) extract specific recommendations the authors give for researchers and providers, (b) map the three inferential threats to checklist items economists should report in empirical papers, or (c) summarize Appendix A’s mapping of openness features to mitigations (if you can provide it).

Assessment

Paper Typetheoretical Evidence Strengthn/a — The paper is a conceptual and normative analysis rather than an empirical study; it offers arguments, examples, and recommendations but does not present primary data or causal identification tests that could be graded for empirical strength. Methods Rigorn/a — Methods consist of conceptual argumentation, literature-driven examples, and normative recommendations rather than formal empirical or experimental methods; rigor therefore rests on clarity, logical coherence, and completeness of the argument rather than statistical or identification procedures. SampleNo primary sample or dataset; the paper analyzes general classes of AI models (open vs closed), points to illustrative examples and documented practices, and reviews conceptual threats to inference and possible mitigation strategies. Themesgovernance innovation GeneralizabilityConclusions are conceptual and not empirically validated across domains, so practical relevance may vary by field (e.g., NLP vs computer vision vs structured-data models)., Effects depend on access modality (open-source code/weights vs API-only access) and the specific contractual/technical constraints imposed by model providers., Recommendations assume researchers have resources and expertise to implement suggested mitigations (e.g., model auditing, synthetic data experiments), which may not hold in all settings., Regulatory or commercial contexts (e.g., safety constraints, IP protection) can limit the feasibility of openness or the recommended transparency measures., Some specific exceptions (e.g., well-documented closed models, benchmarked APIs) may permit stronger inferences than the general critique implies.

Claims (4)

Claim	Direction	Confidence	Outcome	Details
Restrictions on information about model construction and deployment threaten reliable inference in research that involves those models. Research Productivity	negative	high	reliable inference / scientific inference	0.02
Current closed models are generally ill-suited for scientific purposes (with some notable exceptions). Research Productivity	negative	high	suitability of models for scientific research / quality of scientific inference	0.02
The inferential issues that closed models present can be resolved or mitigated by certain measures. Research Productivity	positive	high	reliability of inference after mitigation	0.02
When models are used in research, potential threats to inference should be systematically identified alongside the steps taken to mitigate them, and specific justifications for model selection should be provided. Research Productivity	positive	high	transparency and robustness of research inferences / research practices	0.02