Evidence (7448 claims)
Adoption
5267 claims
Productivity
4560 claims
Governance
4137 claims
Human-AI Collaboration
3103 claims
Labor Markets
2506 claims
Innovation
2354 claims
Org Design
2340 claims
Skills & Training
1945 claims
Inequality
1322 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 378 | 106 | 59 | 455 | 1007 |
| Governance & Regulation | 379 | 176 | 116 | 58 | 739 |
| Research Productivity | 240 | 96 | 34 | 294 | 668 |
| Organizational Efficiency | 370 | 82 | 63 | 35 | 553 |
| Technology Adoption Rate | 296 | 118 | 66 | 29 | 513 |
| Firm Productivity | 277 | 34 | 68 | 10 | 394 |
| AI Safety & Ethics | 117 | 177 | 44 | 24 | 364 |
| Output Quality | 244 | 61 | 23 | 26 | 354 |
| Market Structure | 107 | 123 | 85 | 14 | 334 |
| Decision Quality | 168 | 74 | 37 | 19 | 301 |
| Fiscal & Macroeconomic | 75 | 52 | 32 | 21 | 187 |
| Employment Level | 70 | 32 | 74 | 8 | 186 |
| Skill Acquisition | 89 | 32 | 39 | 9 | 169 |
| Firm Revenue | 96 | 34 | 22 | — | 152 |
| Innovation Output | 106 | 12 | 21 | 11 | 151 |
| Consumer Welfare | 70 | 30 | 37 | 7 | 144 |
| Regulatory Compliance | 52 | 61 | 13 | 3 | 129 |
| Inequality Measures | 24 | 68 | 31 | 4 | 127 |
| Task Allocation | 75 | 11 | 29 | 6 | 121 |
| Training Effectiveness | 55 | 12 | 12 | 16 | 96 |
| Error Rate | 42 | 48 | 6 | — | 96 |
| Worker Satisfaction | 45 | 32 | 11 | 6 | 94 |
| Task Completion Time | 78 | 5 | 4 | 2 | 89 |
| Wages & Compensation | 46 | 13 | 19 | 5 | 83 |
| Team Performance | 44 | 9 | 15 | 7 | 76 |
| Hiring & Recruitment | 39 | 4 | 6 | 3 | 52 |
| Automation Exposure | 18 | 17 | 9 | 5 | 50 |
| Job Displacement | 5 | 31 | 12 | — | 48 |
| Social Protection | 21 | 10 | 6 | 2 | 39 |
| Developer Productivity | 29 | 3 | 3 | 1 | 36 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
| Skill Obsolescence | 3 | 19 | 2 | — | 24 |
| Creative Output | 15 | 5 | 3 | 1 | 24 |
| Labor Share of Income | 10 | 4 | 9 | — | 23 |
The study is qualitative and law-focused and uses Vietnam as a focused case study without collecting primary quantitative field data.
Explicit Data & Methods statement in the paper indicating doctrinal legal analysis, comparative institutional analysis, and normative framework development; no primary quantitative sample.
The study recommends empirical metrics for future evaluation of reforms, including processing time per case, reversal rates on appeal, administrative litigation frequency, compliance and procurement costs, investment flows into public-sector AI, and changes in labor composition and wages in administrative agencies.
Methodological recommendation arising from the paper's normative and comparative analysis.
The paper's argument is principally theoretical and prescriptive and requires empirical validation across domains and at scale.
Author-stated limitation in the Data & Methods section noting that the work is primarily conceptual and that empirical validation is needed.
Operationalizing DSS requires building domain ontologies/knowledge graphs, designing synthetic curricula, training compact domain models, benchmarking against monolithic LLMs, and measuring total cost-of-ownership (energy, latency, bandwidth, infrastructure).
Paper's recommended experimental and measurement agenda (procedural/methodological prescriptions); this is a proposed research plan rather than an empirical result.
Analysis compared responses across 16 predefined dimension pairs (ethical dimensions or response axes) and used repeated measures and qualitative coding to characterize system behavior.
Methods and Analysis sections reporting use of 16 dimension-pair comparisons, repeated-measures tests for delta between blind and declared administrations, and qualitative coding to derive D3 failure taxonomy.
Probe administration included operational controls: runs were administered by two human raters across three machines to ensure operational consistency.
Methods statement describing administration by two human raters on three machines.
The ceiling discrimination probe used Gemini Pro (Google) and Copilot Pro (Microsoft) as independent judges.
Methods: reported use of Gemini Pro and Copilot Pro as independent judges for the ceiling probe.
Primary blind scoring was performed by Claude (Anthropic) used as an LLM judge.
Methods: primary blind scoring explicitly performed by Claude.
Re-administration under declared conditions produced zero delta across all 16 dimension-pair comparisons (no measurable change when declaration status changed).
Reported repeated-measures comparisons across 16 predefined dimension pairs between blind and declared administrations, with reported zero delta.
Series 2 consisted of local and API open-source systems (n = 6) administered blind and declared, with four systems re-administered under declared conditions.
Methods description detailing Series 2 composition, modes (blind and declared), and that four systems were re-tested under declared conditions.
Series 1 consisted of frontier commercial systems administered blind (n = 7).
Methods description specifying Series 1 composition and blind administration.
The study employed 24 experimental conditions spanning 13 distinct LLM systems across two series.
Study design reported in Methods: Series 1 (frontier commercial, blind, n=7), Series 2 (local/API open-source, blind and declared, n=6), plus re-administered declared runs and ceiling-probe runs summing to 24 conditions.
The paper does not claim proprietary deployment metrics beyond qualitative field observations; experimental formalizations are provided for reproducible evaluation instead.
Authors explicitly note they document how to reproduce experiments but do not claim proprietary deployment metrics beyond qualitative field observations.
The paper recommends tracking specific operational and economic metrics: MTTR for tool failures, per-invocation latency variance, per-interaction operational cost, frequency of identity-related incidents, human remediation hours per 1,000 incidents, and SLA breach rates.
Explicit list of recommended metrics in the implications and metrics-to-track sections of the paper.
The paper provides a production-readiness checklist and instructions for reproducible evaluation alongside the proposed mechanisms.
Deliverables enumerated in the paper include a production-readiness checklist and reproducible experimental methodology.
All three proposed mechanisms (CABP, ATBA, SERF) are formalized as testable hypotheses with reproducible experimental methodology (benchmarks, latency/error models, broker pipeline semantics).
Paper includes formal descriptions and reproducible evaluation instructions and benchmarks; authors state methods to reproduce experiments are provided.
The paper organizes production failure modes across five dimensions—server contracts, user context, timeouts, errors, and observability—and provides concrete failure vignettes from an enterprise deployment.
Taxonomy and failure vignettes are listed as design artifacts and deliverables in the paper; derived from observational analysis of production logs and incidents.
The experiment used NYSE TAQ transaction and quote data for SPY covering 2015–2024 and tested six pre-specified hypotheses about market-quality trends.
Data and methods section specifying dataset (NYSE TAQ SPY, 2015–2024), the number of pre-specified hypotheses (six), and experimental protocol with 150 autonomous agents.
Agents' methodological choices and resulting effect estimates were systematically recorded and used to quantify dispersion and measure switching across stages.
Study design description: recorded agents' methodological choices (measure selection, estimation procedures), resulting estimates, and tracked switching and dispersion metrics (IQR) across the three-stage protocol applied to SPY TAQ data (2015–2024) with 150 agents.
AI peer review (agents exchanging written critiques) produced minimal reduction in dispersion of estimates.
Three-stage protocol: after stage 1 (independent analyses) and stage 2 (AI peer review), measured dispersion (e.g., IQR) across agents showed little change following the peer-review stage across the six hypotheses and agent pool (n=150).
The work is qualitative and exploratory — presenting naturalistic phenomena rather than causal empirical estimates, and is intended to be hypothesis-generating rather than definitive.
Methodology explicitly stated: naturalistic, qualitative daily observations over one month across multiple platforms; comparative observational documentation without experimental manipulation or causal identification.
Future empirical work should measure calibration (user trust vs. model accuracy), hallucination rate, user comprehension of capability limits, and behavioral dependence on system recommendations.
Explicit methodological recommendations and suggested metrics in the paper; these are proposed future measurements rather than reported findings.
Conversational AI differs from interpersonal conversation: it has no true beliefs/intentions or accountability and produces probabilistic, sometimes inconsistent outputs with opaque training/data provenance.
Analytical/distinctive claim based on properties of LLMs and machine learning models discussed in the paper; conceptual analysis, no empirical testing.
CoMAI is a modular, four-agent interview-assessment framework coordinated by a centralized finite-state machine.
System design and implementation described in the paper: a pipeline of four specialized agents (question generation, security/validation, scoring by rubric, summarization/reporting) with a centralized finite-state machine enforcing workflow and information flow constraints.
Field experiments (A/B testing) and willingness-to-pay experiments are necessary to quantify monetary benefits, adoption curves, and optimal pricing for alignment capabilities.
Paper explicitly recommends these empirical approaches in the recommendations for economists and product teams; this is a methodological recommendation rather than an empirical finding.
Recommended evaluation directions include automatic metrics (embedding similarity, task success, turn counts), human evaluation (satisfaction, perceived collaboration), and A/B testing in deployed settings (latency, compute, retention).
Paper's explicit evaluation proposals and recommended metrics listed in the Data & Methods and Evaluation Directions sections; these are prescriptive recommendations rather than executed experiments.
The paper focuses on architecture and conceptual arguments rather than reporting large-scale empirical datasets or results.
Data & Methods section and overall document framing emphasize architecture description and proposed evaluations; explicitly notes absence of large-scale empirical results in the provided summary.
Alignment verification can be implemented using semantic embeddings (cosine similarity) or learned classifiers with threshold-based decision branching.
Paper describes these as recommended implementation approaches for the alignment verification component; no empirical benchmark comparing methods is reported.
Temporal decay in the retrieval component can be modeled with functions such as exponential decay and a tunable half-life parameter applied to dialogue-turn embeddings.
Methodological description in the paper specifying temporal decay modeling options (exponential decay example) and tunable parameters; descriptive claim about intended implementation (no empirical comparison of decay functions provided).
Research agenda items for economists include: quantifying willingness-to-pay for verifiable reasoning, studying labor-market impacts for validators, designing contracts/mechanisms to incentivize truthful argument provision, and evaluating regulatory interventions.
Paper's stated research and policy agenda; prescriptive rather than empirical.
Evaluation currently lacks metrics and benchmarks for argument quality, fidelity, contestability, and human trust; developing these is necessary.
Paper notes the gap and proposes evaluation metrics and experimental designs; no new benchmarks introduced.
Methodology is primarily conceptual and normative: the paper synthesizes policy texts, safety standards, and crisis-management literature and relies on illustrative mappings and thought experiments rather than new empirical field data.
Authors' methodological description in the Data & Methods section (explicit statement about sources and use of thought experiments).
The paper defines and specifies four oversight modes (spanning near-full autonomy to strict human control) and provides criteria for selecting modes based on task complexity, risk level, and consequence severity.
Conceptual taxonomy developed in the paper; mapping exercises and triage framework (risk–complexity–consequence) presented as illustrative mappings (no empirical testing).
Sample sizes reported: human–AI experiment n = 126; human–human benchmark n = 108.
Study's Data & Methods section reporting sample sizes for the human–AI experiment (n = 126) and citing the human–human benchmark (Dvorak & Fehrler 2024, n = 108).
Experimental design: subjects played an indefinitely repeated Prisoner’s Dilemma in supergames with two between-subjects treatments varying chat timing (chat only before first round of each supergame vs chat before every round); the AI partner was GPT-5.2.
Methods description of the lab experiment reported in the paper: indefinitely repeated PD in supergames, two chat-frequency between-subjects treatments, AI implemented as GPT-5.2; human–AI sample n = 126.
Allowing repeated pre-play communication (chat before every round) has no detectable effect on cooperation rates when the partner is an AI.
Between-subjects manipulation within the human–AI experiment comparing chat-before-first-round vs chat-before-every-round treatments (human–AI n = 126 total); statistical comparison of cooperation rates across the two chat-frequency treatments showed no detectable difference.
Initial cooperation rates against the AI (GPT-5.2) are high and comparable to initial cooperation in human–human pairs.
Laboratory experiment with human subjects playing an indefinitely repeated Prisoner’s Dilemma against an AI chatbot (GPT-5.2); human–AI sample n = 126; human–human benchmark taken from Dvorak & Fehrler (2024) with n = 108; comparison of initial-round / early-round cooperation rates across conditions.
Evaluation metrics for the architecture should include sample efficiency, generalization across tasks, robustness to distribution shift, autonomy (fraction of learning decisions made internally), transfer speed, lifelong retention, and safety/constraint adherence.
Explicit recommendations for evaluation metrics in the paper.
This paper is a conceptual/theoretical architecture proposal rather than an empirical study; empirical validation should follow via suggested experiments.
Explicit statement in the paper about nature of contribution.
Results are from role-play contexts and short-term interventions; economic estimates of benefit require validation in field settings, across diverse populations, and with different LLM models.
Authors' caveats and limitations stated in the paper noting external validity concerns and the experimental context (role-play, short-term follow-up).
Outcome measures included alignment to the normative taxonomy (coding/automated), recipient-rated perceptions of being heard/validated, and blinded empathy judgments.
Methods section description listing primary and secondary outcomes used in the trial and evaluations.
A data-driven taxonomy was derived mapping common idiomatic empathic moves (e.g., validation, perspective-taking, emotional labeling, offers of support) used in naturalistic support conversations.
Textual analysis of the collected corpus (33,938 messages) produced an operational taxonomy of idiomatic empathic expressions used in the role-play dialogues.
The Lend an Ear platform collected a large conversational corpus: 33,938 messages across 2,904 conversations with 968 participants.
Dataset description reported in the paper specifying counts of participants, conversations, and messages used to build and analyze communication patterns.
Suggested empirical research directions for AI economists include: comparing LLM performance and economic outcomes on rule‑encodable vs tacit tasks; quantifying performance decline when forcing LLMs into interpretable rule representations; studying contracting/pricing where buyers cannot verify internal rules; and measuring returns to scale attributable to tacit capabilities.
Explicitly enumerated recommended research agenda items in the paper; these are proposed studies rather than executed work.
New metrics are needed to value tacit capabilities — e.g., measures of transfer, generalization under distribution shifts, ease of integrating with human workflows, and irreducibility to compressed rule representations.
Methodological recommendation in the paper listing specific metric categories for future empirical work.
Suggested empirical validations (not performed) include benchmarking LLMs versus rule systems on allegedly rule‑encodable tasks, attempting rule extraction and measuring fidelity loss, and compression/distillation studies to quantify irreducible task performance.
Recommendations and proposed experimental directions listed in the paper; these are proposals, not executed studies.
The paper contains mostly qualitative and historically grounded empirical content and reports no primary datasets or large‑scale experimental results in support of the formal thesis.
Explicit declaration in the Data & Methods section that empirical content is qualitative/historical and no new datasets were collected.
The paper's core methodological approach is conceptual and theoretical argumentation (formal/logical proof, historical examples, and philosophical framing), not empirical experimentation.
Stated Data & Methods description indicating reliance on formal logic, historical case analysis, and philosophical argument; absence of primary datasets.
LLM-as-Judge finds no significant difference between the retrieval-augmented and vanilla generators (p = 0.584).
Comparative evaluation using standard LLM-as-Judge metrics reported in the paper on the same experimental setup; reported p-value = 0.584.
MessyKitchens is designed to stress occlusion, object variety, and complex inter-object relations (i.e., it is more realistic/physically-rich than prior datasets).
Design and motivation section in paper stating dataset construction targets clutter, occlusion, object variety, and complex object relations; dataset includes explicit contact annotations to capture interactions.