Evidence (13870 claims)
Adoption
8467 claims
Productivity
7558 claims
Governance
6805 claims
Human-AI Collaboration
6363 claims
Org Design
4132 claims
Innovation
4065 claims
Labor Markets
3526 claims
Skills & Training
2945 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 196 | 98 | 892 | 1984 |
| Governance & Regulation | 817 | 394 | 188 | 121 | 1544 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 627 | 233 | 123 | 96 | 1088 |
| Research Productivity | 411 | 123 | 56 | 332 | 933 |
| Output Quality | 467 | 178 | 59 | 47 | 751 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 167 | 122 | 24 | 496 |
| Task Allocation | 207 | 64 | 71 | 32 | 379 |
| Skill Acquisition | 165 | 59 | 60 | 17 | 301 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 52 | 107 | 13 | 279 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 150 | 48 | 26 | 3 | 227 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 63 | 20 | 12 | 184 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 93 | 21 | 13 | 19 | 148 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Creative Output | 31 | 17 | 7 | 3 | 59 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Changes in skill demand in online labour markets are an outcome of introducing platform-embedded GenAI.
Synthesis of the study's empirical findings (difference-in-differences results showing increased skill diversity in logo jobs post-logo-AI and mediation evidence via competition) leading to the broader conclusion that platform-embedded GenAI can change skill demand on online labour platforms.
Stronger competition among freelancers partially mediates the effect of the platform-embedded logo-AI on higher skill diversity in logo jobs.
Mediation analysis within the difference-in-differences framework linking measures of freelancer competition to changes in requested skill diversity after the logo-AI launch. Specific mediation estimation details and sample size not provided in the abstract.
Logo jobs exhibit higher skill diversity than other design jobs after the platform introduced logo-AI.
Difference-in-differences comparison of skill-diversity metrics extracted via the authors' LLM-based skill extraction and embedding framework on EPWK job posts for logo design (treatment) versus other design jobs (control), pre- and post-introduction of the platform-embedded logo-AI tool. Sample size not reported in the abstract.
Ongoing efforts of the initiative aim to incorporate benchmarks that address concerns about bias by considering alternative perspectives and human centered use cases.
Statement of planned/ongoing work in the paper regarding future benchmark inclusion to address bias and human-centered use cases; no empirical results provided.
Implemented tests include causal translation, model iteration, causal reasoning, conformance, model behavior explanation, suggested model building steps, and suggested model fixes.
Specific list of implemented test categories provided in the paper; descriptive/reporting evidence from the initiative's work.
Tests for several distinct categories of evaluation have been implemented and applied to AI tools that support qualitative model building, quantitative model building, and model discussion.
Paper reports that a set of tests have been implemented and applied to AI tools across qualitative and quantitative modeling and discussion; no sample sizes or numeric evaluation results provided in the excerpt.
A steering group focuses on prioritizing potential benchmarks, while a technical group focuses on implementing the benchmarks in the form of automated tests.
Organizational description in the paper specifying roles (steering group and technical group); no quantitative evaluation reported.
The open source sd ai project hosted by the initiative establishes transparency and enables contributions to be shared broadly.
Descriptive statement about the open-source project hosted by the initiative; no empirical measures of transparency or contribution sharing provided.
The initiative uses open digital and organizational infrastructure to collaboratively evaluate AI tools for modeling and simulation.
Descriptive claim in the paper about organizational approach (open infrastructure and collaborative evaluation); no empirical testing or sample size reported.
The BEAMS Initiative aims to guide the development of AI tools for modeling and simulation toward forms that are responsible and ethical by establishing benchmarks for human centered modeling and simulation practices.
Descriptive statement about the Initiative's stated aims and purpose in the paper; organizational description rather than empirical evidence.
Tools that can automate aspects of modeling practice must complement human expertise, not replace it.
Normative claim made in the paper (argument about human-centered design); no empirical evidence or sample size reported.
AI tools to support real world decision making must be able to build simulation models that inform their recommendations and render them interpretable.
Normative assertion in the paper (position statement / requirement); no empirical study or sample size reported.
The agentic future is not predetermined; leaders must both skate to where the puck is going and actively steer it toward a good place, ensuring innovation delivers welfare gains felt by businesses and consumers around the world.
Normative recommendation offered by the authors; based on conceptual argument and interpretation of the framework rather than empirical testing in the excerpt.
These complementary investments produce the familiar 'productivity J-curve' of general-purpose technologies.
Stated as an economic analogy/claim drawing on general-purpose technology literature; presented as an asserted mechanism rather than shown with new empirical estimates in the excerpt.
The most consequential disruption resides in the third stage (Reconstruction) where workflows and markets are rebuilt around delegation, machine-to-machine interaction, continuous monitoring, and auditable constraints.
Theoretical claim in the paper backed by conceptual reasoning and illustrative sector examples; no quantitative evidence provided in the excerpt.
The system preserves human agency via override mechanisms.
Design description of the collaborative forecasting system that explicitly includes override controls for human users.
The paper provides a rigorous blueprint for designing synergistic, trustworthy, and diagnostic operational planning tools, contributing to the discourse on human-AI collaboration and sustainable information systems (IS).
Stated contribution in the paper's conclusions: presentation of a blueprint and implications for human-AI collaboration and sustainable IS.
Two think-aloud sessions show that human judgment remains critical for high-uncertainty events.
Qualitative evaluation consisting of two think-aloud sessions reported in the paper.
Algorithmic benchmarking reduced forecast errors by 30% over naive baselines.
Quantitative algorithmic benchmarking reported in the evaluation section of the paper (comparison vs. naive baselines).
Because reputation-based, ex post sanctions cannot be relied upon for dissociative agents, governance should shift to observability-based, ex ante, constitutive, protocol-based behavioral harnesses.
Prescriptive recommendation derived from the theoretical critique of identity-based governance; paper proposes observability- and protocol-focused alternatives but does not present empirical tests or trials.
Reputation mechanisms function both as social signals and as corrective feedback that sustain an equilibrium of trustworthy behavior, presuming a persistent identity associated with behavioral continuity, sanction sensitivity, and costly non-fungibility.
Conceptual/theoretical argument presented in the paper drawing on reputation theory and social signaling; no empirical sample or quantitative data reported.
Teams interacting with more embodied agents display conversational patterns that more closely resemble human–human dialogue.
Conversational analysis comparing dialogue patterns across teams interacting with different embodiment levels; the abstract reports greater similarity to human–human dialogue for teams with higher embodiment agents, but does not provide the similarity metric values or sample sizes.
Human-only teams are more likely to complete all tasks successfully (higher task completion success) than mixed human–AI teams.
Comparison of task completion success between human-only teams and mixed teams in the escape room experiment as reported in the paper; no numerical completion rates provided in the abstract.
Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.
Synthesis conclusion based on RADAR deployment results, telemetry (535K+ reviewed diffs, 331K+ landed), and comparative analyses (before-after and difference-in-differences) reported in the paper.
RADAR reduces median diff review wall time by 35%.
Efficiency outcomes reported via telemetry and difference-in-differences analysis stated in the paper; median diff review wall time reduction reported as 35%. Sample likely drawn from RADAR telemetry (535K+ diffs) though not explicitly stated for this metric in the excerpt.
RADAR reduces median time to close by over 330%.
Efficiency outcomes reported via telemetry and difference-in-differences analysis stated in the paper; median time-to-close reduction reported as 'over 330%'. Underlying sample for efficiency analysis likely from the RADAR telemetry (535K+ diffs), though the excerpt does not give the precise sample for this metric.
The Production Incident rate for RADAR-reviewed diffs is 1/50 that of non-RADAR diffs.
Comparative observational analysis reported in the paper; production incident rate for RADAR-reviewed diffs compared to non-RADAR diffs, with the relative rate given as 1/50. Exact absolute counts not provided in the excerpt; overall RADAR telemetry covers 535K+ diffs.
The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs.
Comparative observational analysis reported in the paper contrasting RADAR-reviewed diffs with non-RADAR diffs. Underlying counts and exact sample split not provided in the excerpt; overall RADAR telemetry covers 535K+ diffs.
Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%.
Policy threshold comparison reported in the paper using observational before-after comparisons and system telemetry; approval rate reported as 60.31% after threshold change.
RADAR has reviewed 535K+ diffs and landed 331K+ changes.
System deployment telemetry reported in the paper: 'RADAR has reviewed 535K+ diffs and landed 331K+.'
Agentic AI was responsible for over 80% of that growth in code volume.
Attribution analysis reported in the paper linking growth in code/diff volume to agentic AI sources; described as 'over 80% of that growth.' The underlying attribution method is not detailed in the excerpt.
Per-developer diff volume rose 51% (year over year) at Meta.
Internal telemetry/observational analysis reported in the paper; stated as a 51% increase in per-developer diff volume. No explicit sample size for this specific measure provided in the excerpt.
At Meta, significant lines of code per human-landed diff grew by 105.9% year over year.
Internal telemetry/observational analysis reported in the paper; stated as a year-over-year percentage growth for Meta. No sample size for this specific measure provided in the excerpt.
The concordance has many relevant applications in research and policy analyses of innovation.
Claim about the utility and applicability of the concordance stated by the authors; no enumeration of specific applications or empirical demonstrations included in excerpt.
The concordance can be used to track the diffusion of patented technologies at the technology, firm, region, or country level.
Stated intended applications of the concordance in the paper; excerpt does not present empirical case studies or performance metrics.
We develop, validate and share a novel concordance between technology classes in patent records and market classes in trademark records.
Primary methodological contribution reported by the authors (development, validation, and sharing of a concordance); excerpt does not include validation method details or sample size.
Patent and trademark data can be combined to link given technologies to specific markets.
Conceptual/methodological claim in paper proposing combination of patent and trademark records to map technologies to markets; excerpt does not include empirical validation details.
Trademark filings that accompany the market introduction of new goods and services are a data source that can reveal the market introduction of technologies.
Descriptive claim in paper noting trademarks as a complementary data source to patents; no sample size or validation details in excerpt.
Patent data is the preferred source of information for tracking technological change.
Statement in paper (introductory claim); no empirical sample or method reported in excerpt.
Restoring honest billing will require verification that ties reported token counts to evidence the provider does not control, such as trusted execution attestation, cryptographic proofs of inference, or third-party re-execution.
Policy/recommendation proposed by the authors based on their findings (argument that independent verification is necessary).
Even when the user can see the full reasoning string, tokenization ambiguity alone still allows 50.85% over-reporting below the detection threshold.
Experimental result reported in the paper showing over-reporting due solely to tokenizer ambiguity when reasoning string is visible (no sample size in excerpt).
At current frontier reasoning prices, that turns a $100 honest bill into roughly a $1,569 bill on the same query.
Numerical example/price calculation based on the reported inflation (uses current frontier reasoning prices; calculation given by the authors).
In the most permissive setting, hidden reasoning usage can be inflated by 1,469% on average without detection.
Experimental/adversarial evaluation reported in the paper showing average inflation in a permissive audit setting (no sample size for queries provided in excerpt).
We study three recent token auditing frameworks and show that a provider with ordinary commercial capabilities can systematically inflate billed token counts.
Empirical/analytical evaluation of three token-auditing frameworks studied by the authors; adversarial provider simulation/experiment (paper states three frameworks were studied).
Per-token billing is now the standard pricing model for commercial large language models (LLMs).
Author assertion about prevailing commercial pricing practices (no empirical sample or citation provided in excerpt).
We discuss implications for Information Systems (IS) design and propose future field evaluations.
Paper includes a discussion section outlining IS design implications and suggestions for future empirical/field work.
The approach preserves statistical rigour, traceability, and nuanced Persevere/Iterate decisions when accelerating experimentation.
Reported outcomes of controlled simulations and description of system design that enforces statistical procedures and logging; stated in manuscript as findings.
Logs render capabilities observable at the feature level, turning 'agentic AI' into a disciplined experimentation infrastructure rather than a generic assistant.
Implementation logs and descriptions from the Node.js instantiation reported in the paper; qualitative claim about observability and traceability at the feature level.
The Multi Agent System reduces time-to-validated-learning by roughly an order of magnitude while preserving statistical rigour, traceability, and nuanced Persevere/Iterate decisions.
Results from the controlled simulations reported in the paper (comparison between agentic multi-agent system and manual B-M-L cycles).
Controlled simulations compare agentic and manual B-M-L cycles on feature ideas.
Reported controlled simulation experiments in the paper comparing agentic (multi-agent) and manual B-M-L cycles; methodological description present in manuscript.