Evidence (11677 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5921 claims
Human-AI Collaboration
5192 claims
Org Design
3497 claims
Innovation
3492 claims
Labor Markets
3231 claims
Skills & Training
2608 claims
Inequality
1842 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 738 | 1617 |
| Governance & Regulation | 671 | 334 | 160 | 99 | 1285 |
| Organizational Efficiency | 626 | 147 | 105 | 70 | 955 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 349 | 109 | 48 | 322 | 838 |
| Output Quality | 391 | 121 | 45 | 40 | 597 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 277 | 145 | 63 | 34 | 526 |
| AI Safety & Ethics | 189 | 244 | 59 | 30 | 526 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 106 | 40 | 6 | 188 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 79 | 8 | 1 | 152 |
| Regulatory Compliance | 69 | 66 | 14 | 3 | 152 |
| Training Effectiveness | 82 | 16 | 13 | 18 | 131 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
There is a persistent female disadvantage in work intensity.
Analysis of EWCTS 2021 with IFR robot exposure measures using weighted logit models controlling for individual and job covariates and fixed effects; gender-specific patterns examined via interaction terms.
Monthly operational cost of running the system is approximately USD 4,000.
Full-scale performance characterization reports monthly cost estimate of approximately USD 4,000.
Breach externalities expand the range of environments in which deployment is socially constrained.
Analytical model extension/discussion: inclusion of breach externalities increases the set of parameter values where socially optimal deployment is limited.
Optimal deployment falls below the no-risk benchmark, and this shortfall widens with breach-loss magnitude and with the authority exposure attached to more capable systems.
Analytical comparative-statics results from the model showing optimal deployment relative to a no-risk benchmark and sensitivity to breach-loss magnitude and authority exposure.
Central result (the 'deployment paradox'): in high-loss environments, better AI can lead a firm to deploy less when capability is deployed through broader authority exposure under weak governance.
Analytical result derived from the paper's theoretical model (no empirical sample; comparative statics in the model demonstrate this effect).
The supply of AI-literate workers attenuates wage inequality effects.
Presented in the article as a distributional mechanism informed by synthesized theoretical and empirical findings; no concrete empirical methods or sample sizes are provided in the excerpt.
Prior work has largely focused on developing novel cooperative architectures while overlooking the question of when joint training is necessary.
Literature-review style claim made in the paper asserting a gap in prior research emphasis (novel cooperative architectures) versus investigation of training modality necessity.
The coordination gap advantage (between joint and modular training) diminishes in bottleneck environments, particularly under severe transport and processing constraints.
Results from a sensitivity analysis varying resource scarcity and temporal dominance showing the relative performance gap shrinks under bottleneck conditions with tight transport and processing constraints. Details on experimental scenarios not provided in the abstract.
These gaps are structural; more engineering effort alone will not close them.
Authors' argument/conclusion based on their analytical comparison and gap analysis (normative/assertive claim).
We identify five critical gaps (semantic intent verification, recursive delegation accountability, agent identity integrity, governance opacity and enforcement, and operational sustainability) that no current technology or regulatory instrument resolves.
Gap analysis synthesized from the structured survey of industry trends, standards, and literature; presented as findings in the paper.
An evaluation of current technical and regulatory documents against the identity requirements of autonomous agents finds that none adequately address the challenge of governing nondeterministic, boundary-crossing entities.
Document review / evaluation reported in the abstract (structured survey of technical and regulatory documents); specific documents and number reviewed are not specified in the abstract.
A structural comparison of human and AI identity across four dimensions (substrate, persistence, verifiability, and legal standing) shows that the asymmetry is fundamental and that extending human frameworks to agents without structural modification produces systematic failures.
Authors' structural comparison (analytical/theoretical method) across four dimensions, reported as a core contribution of the paper.
This creates a problem no current infrastructure is equipped to solve: how do you identify, verify, and hold accountable an entity with no body, no persistent memory, and no legal standing?
Authors' gap analysis informed by a structured survey of industry trends, emerging standards, and technical literature; presented as a synthesized conclusion from that survey.
Before the AI transition, editors should tighten acceptance standards to curb rent-dissipating author polishing.
Optimal policy characterization in the model for the regime where AI capability is below the critical threshold; derived analytically under model assumptions.
When AI capability crosses a critical threshold, reviewer effort collapses discontinuously.
Analytical result proved within the paper's three-sided equilibrium model; threshold and collapse derived theoretically (no empirical sample).
Generative AI acts as a disruptive technological shock to evaluative organizations.
Stated as the motivating premise and developed throughout via a theoretical three-sided equilibrium model in the paper; no empirical sample reported (the claim is supported by model construction and analysis).
The framework addresses emerging tensions captured in the Creativity Paradox, whereby GenAI may weaken intrinsic motivation, conceptual risk-taking, and evaluative depth.
Theoretical extension of paradox theory and conceptual discussion of potential negative effects; presented as conceptual risks rather than empirically demonstrated outcomes.
Making AI usable can thus make procedures easier for future governments to learn and exploit.
Synthesis concluding claim based on the paper's formal model and argumentation (theoretical; no empirical testing reported).
The model shows why expansions in AI use may be difficult to unwind.
Analytical conclusion from the paper's formal model (theoretical argument without empirical sample).
The model explains why reforms that initially improve oversight can later increase that vulnerability.
Analytical/theoretical result from the paper's formal model (presented as an explanation; no empirical data).
The model shows when these systems become vulnerable to strategic use from within government.
Analytical result derived from the paper's formal theoretical model (no empirical validation reported).
The compliance layer can also create a stable approval boundary that political successors learn to navigate while preserving the appearance of lawful administration.
Stated conclusion/insight from the paper's formal argument and conceptual framing (theoretical, no empirical sample).
Manual tools like mind maps support structure creation but lack intelligent (AI) assistance.
Paper's comparison of manual tools versus AI-augmented tools (background/related-work discussion; no empirical evaluation reported for this claim).
Current LLM-based systems let users query information but do not let users shape how knowledge is organized.
Paper's analysis of existing tools and limitations (literature/feature comparison described in introduction; no new empirical test reported).
Knowledge workers face increasing challenges in synthesizing information from multiple documents into structured conceptual understanding.
Statement in paper's introduction/motivation; conceptual observation (no empirical data reported here).
The near-uncorrelated rankings and rank shifts on the n=11 subset are driven by a strong negative Adoption-Capability correlation among closed-source high-capability agents within this subset.
Subgroup analysis/observation within the 11-agent SWE-bench overlap indicating a negative correlation between Adoption and Capability for closed-source high-capability agents (no numerical coefficient reported in the excerpt).
Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment.
Conceptual statement in the paper; no empirical sample cited for this specific claim (framing/argumentation).
Self-assessment is a key bottleneck for market-style coordination of AI agents.
Conclusion drawn from empirical results (miscalibration findings, auction divergence, modest improvement from prior-information intervention) reported in the paper.
Auctions built from these self-reports diverge from a full-information allocation.
Simulation or empirical auction experiments using self-reported signals from the six LLMs on the 93 tasks, compared to a full-information allocation benchmark (method described in paper).
These LLMs are miscalibrated on both success probability and token usage.
Empirical evaluation of six LLMs on 93 SWE-bench Lite tasks assessing calibration of predicted success probabilities and token usage (as reported in the paper).
Standard PayGo degrades substantially under classroom-scale concurrency.
Empirical latency measurements and comparative analysis across throughput tiers and concurrency levels in the instrumented deployment.
Each student query triggers several concurrent API calls whose latencies compound through a parallel-phase maximum effect that single-agent systems do not face.
Architectural description and instrumentation of the four-agent ITAS system (paper reports measurements and latency analysis across tiers and concurrency levels).
In the absence of intervention, individually rational adoption of genAI will assuredly and profoundly reduce collective welfare.
Conclusion drawn from the paper's theoretical model (normative/predictive claim based on model dynamics; no empirical validation or sample reported in abstract).
Habit formation around genAI use can couple otherwise separate domains, so that adoption in low-stakes tasks spills over into high-value tasks and amplifies welfare losses.
Theoretical/model-based claim showing coupling across domains via habit formation (model extension; no empirical sample reported in abstract).
The introduction of genAI—while initially beneficial at the individual level—will reduce social welfare for the most important types of tasks.
Model-derived result: theoretical analysis indicates social-welfare reductions in high-value tasks despite individual gains (no empirical sample reported in abstract).
Generative models are vulnerable to model collapse: when trained on data generated by earlier versions of themselves, their outputs can lose diversity and accuracy.
Theoretical claim / conceptual claim presented in the paper (no empirical sample size given in abstract); refers to degradation of model outputs when trained on self-generated data.
Frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs.
Evaluation of models' self-predicted token cost versus realized token usage across agentic runs on SWE-bench Verified; reported correlations up to 0.39 and systematic underestimation bias.
Models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5.
Cross-model comparisons of average total token consumption per task run across the eight evaluated LLMs on SWE-bench Verified; paper reports average differential between named models and GPT-5.
Input tokens rather than output tokens drive the overall cost of agentic tasks.
Breakdown of token usage into input vs output token components from the analyzed agentic task trajectories on SWE-bench Verified (across the eight LLMs evaluated).
Agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat.
Empirical measurement of token counts from agentic coding task runs compared to runs labeled as code reasoning and code chat across the evaluated trajectories (paper reports comparisons on SWE-bench Verified across eight frontier LLMs).
Industrial robots are widely used in manufacturing, yet most manipulation still depends on fixed waypoint scripts that are brittle to environmental changes.
Background statement in the paper's introduction; general literature/field observation (no new primary data reported for this claim in the abstract).
Under our definition, contestants with types below certain threshold (low types) always engage in benchmark hacking, whereas those above the threshold do not.
Theoretical result (characterization/theorem) derived from the contest model showing threshold behavior in equilibrium across contestant types.
Each new task domain requires painstaking, expert-driven harness engineering: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective.
Author assertion in the paper's introduction/abstract describing the state of practice; no empirical method, dataset, or sample size reported in the excerpt.
A vulnerability class is characterised for expected-utility maximisers that makes them susceptible to adversarial gambles.
Formal characterization/definition and analytical derivation in the paper describing which expected-utility maximisers are vulnerable to adversarial (Pascal-type) offers; theoretical examples provided rather than empirical tests.
Ungoverned coupling between humans and AI can produce fragility, lock-in, polarization, and domination basins.
Theoretical/modeling analysis showing destabilizing dynamics and multiple basins of attraction when governance regularization is absent or weak; no empirical sample.
Classical robot ethics framed around obedience (e.g. Asimov's laws) is too narrow for contemporary AI systems.
Literature synthesis and conceptual argument drawing on developments in adaptive, generative, embodied, and embedded AI; no empirical sample reported.
Industry digital maturity weakens the effect of the peer leader on a focal firm’s AI adoption.
Interaction/heterogeneity analysis in fixed-effects regression models on panel data of publicly listed Chinese firms (2012–2023), using an industry digital maturity moderator.
Current evaluation proxies are insufficient for predicting downstream human impact.
Empirical results in the paper showing decoupling between standard quantitative proxies (e.g., sparsity, faithfulness) and human outcomes (clarity, decision utility, confidence) across datasets and analyst reviews.
A highlighting policy that is optimal for sophisticated agents can perform arbitrarily poorly when deployed to naive agents.
Constructive worst-case examples and theoretical bounds in the paper demonstrating arbitrarily large performance degradation when applying sophisticated-optimal policies to naive agents.
Optimizing highlighting for sophisticated agents can be computationally intractable, even in simple discrete and binary settings.
Theoretical complexity results and proofs in the paper showing hardness of the optimization problem under the sophisticated-agent model; no sample/calibration required (formal/algorithmic analysis).