The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (13870 claims)

Adoption
8467 claims
Productivity
7558 claims
Governance
6805 claims
Human-AI Collaboration
6363 claims
Org Design
4132 claims
Innovation
4065 claims
Labor Markets
3526 claims
Skills & Training
2945 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 749 196 98 892 1984
Governance & Regulation 817 394 188 121 1544
Organizational Efficiency 771 189 124 83 1177
Technology Adoption Rate 627 233 123 96 1088
Research Productivity 411 123 56 332 933
Output Quality 467 178 59 47 751
Decision Quality 320 174 75 42 618
Firm Productivity 435 55 88 20 604
AI Safety & Ethics 214 276 65 33 593
Market Structure 178 167 122 24 496
Task Allocation 207 64 71 32 379
Skill Acquisition 165 59 60 17 301
Innovation Output 203 27 43 18 292
Employment Level 105 52 107 13 279
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 116 63 42 11 232
Firm Revenue 150 48 26 3 227
Inequality Measures 44 122 49 6 221
Task Completion Time 169 29 8 12 219
Worker Satisfaction 89 63 20 12 184
Error Rate 69 92 10 2 173
Regulatory Compliance 76 68 14 5 163
Training Effectiveness 93 21 13 19 148
Wages & Compensation 77 36 25 6 144
Automation Exposure 51 54 22 12 142
Team Performance 86 17 27 9 140
Developer Productivity 94 17 14 6 132
Job Displacement 12 80 20 1 113
Hiring & Recruitment 51 7 8 3 69
Creative Output 31 17 7 3 59
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 17 17 51
Worker Turnover 11 12 3 26
Industry 1 1
Changes in skill demand in online labour markets are an outcome of introducing platform-embedded GenAI.
Synthesis of the study's empirical findings (difference-in-differences results showing increased skill diversity in logo jobs post-logo-AI and mediation evidence via competition) leading to the broader conclusion that platform-embedded GenAI can change skill demand on online labour platforms.
high positive Exploring The Effect Of Platform-Embedded Generative Ai On S... skill demand (changes in requested skills on online labour platforms)
Stronger competition among freelancers partially mediates the effect of the platform-embedded logo-AI on higher skill diversity in logo jobs.
Mediation analysis within the difference-in-differences framework linking measures of freelancer competition to changes in requested skill diversity after the logo-AI launch. Specific mediation estimation details and sample size not provided in the abstract.
high positive Exploring The Effect Of Platform-Embedded Generative Ai On S... skill diversity (mediated by freelancer competition)
Logo jobs exhibit higher skill diversity than other design jobs after the platform introduced logo-AI.
Difference-in-differences comparison of skill-diversity metrics extracted via the authors' LLM-based skill extraction and embedding framework on EPWK job posts for logo design (treatment) versus other design jobs (control), pre- and post-introduction of the platform-embedded logo-AI tool. Sample size not reported in the abstract.
high positive Exploring The Effect Of Platform-Embedded Generative Ai On S... requested skill diversity in job posts
Ongoing efforts of the initiative aim to incorporate benchmarks that address concerns about bias by considering alternative perspectives and human centered use cases.
Statement of planned/ongoing work in the paper regarding future benchmark inclusion to address bias and human-centered use cases; no empirical results provided.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... planned incorporation of bias-aware benchmarks and human-centered use case consi...
Implemented tests include causal translation, model iteration, causal reasoning, conformance, model behavior explanation, suggested model building steps, and suggested model fixes.
Specific list of implemented test categories provided in the paper; descriptive/reporting evidence from the initiative's work.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... types/categories of tests implemented
Tests for several distinct categories of evaluation have been implemented and applied to AI tools that support qualitative model building, quantitative model building, and model discussion.
Paper reports that a set of tests have been implemented and applied to AI tools across qualitative and quantitative modeling and discussion; no sample sizes or numeric evaluation results provided in the excerpt.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... existence and application of implemented evaluation tests across types of modeli...
A steering group focuses on prioritizing potential benchmarks, while a technical group focuses on implementing the benchmarks in the form of automated tests.
Organizational description in the paper specifying roles (steering group and technical group); no quantitative evaluation reported.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... organizational roles for benchmark prioritization and implementation
The open source sd ai project hosted by the initiative establishes transparency and enables contributions to be shared broadly.
Descriptive statement about the open-source project hosted by the initiative; no empirical measures of transparency or contribution sharing provided.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... transparency and breadth of contributions enabled by the open source sd ai proje...
The initiative uses open digital and organizational infrastructure to collaboratively evaluate AI tools for modeling and simulation.
Descriptive claim in the paper about organizational approach (open infrastructure and collaborative evaluation); no empirical testing or sample size reported.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... use of open infrastructure for collaborative evaluation
The BEAMS Initiative aims to guide the development of AI tools for modeling and simulation toward forms that are responsible and ethical by establishing benchmarks for human centered modeling and simulation practices.
Descriptive statement about the Initiative's stated aims and purpose in the paper; organizational description rather than empirical evidence.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... existence and purpose of the BEAMS Initiative (benchmarking for responsible/ethi...
Tools that can automate aspects of modeling practice must complement human expertise, not replace it.
Normative claim made in the paper (argument about human-centered design); no empirical evidence or sample size reported.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... relationship between automated modeling tools and human expertise (complementari...
AI tools to support real world decision making must be able to build simulation models that inform their recommendations and render them interpretable.
Normative assertion in the paper (position statement / requirement); no empirical study or sample size reported.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... ability of AI tools to build interpretable simulation models that inform recomme...
The agentic future is not predetermined; leaders must both skate to where the puck is going and actively steer it toward a good place, ensuring innovation delivers welfare gains felt by businesses and consumers around the world.
Normative recommendation offered by the authors; based on conceptual argument and interpretation of the framework rather than empirical testing in the excerpt.
high positive From Augmentation to Reconstruction: Guiding the AI Disrupti... policy/leadership influence on welfare distribution of AI-driven innovation
These complementary investments produce the familiar 'productivity J-curve' of general-purpose technologies.
Stated as an economic analogy/claim drawing on general-purpose technology literature; presented as an asserted mechanism rather than shown with new empirical estimates in the excerpt.
high positive From Augmentation to Reconstruction: Guiding the AI Disrupti... productivity trajectory (J-curve) following complementary investments
The most consequential disruption resides in the third stage (Reconstruction) where workflows and markets are rebuilt around delegation, machine-to-machine interaction, continuous monitoring, and auditable constraints.
Theoretical claim in the paper backed by conceptual reasoning and illustrative sector examples; no quantitative evidence provided in the excerpt.
high positive From Augmentation to Reconstruction: Guiding the AI Disrupti... magnitude/importance of disruption arising from Reconstruction-stage changes
The system preserves human agency via override mechanisms.
Design description of the collaborative forecasting system that explicitly includes override controls for human users.
high positive Schnitzel-Prediction: Designing Human-Ai Collaboration For C... preservation of human agency (ability to override algorithmic forecasts)
The paper provides a rigorous blueprint for designing synergistic, trustworthy, and diagnostic operational planning tools, contributing to the discourse on human-AI collaboration and sustainable information systems (IS).
Stated contribution in the paper's conclusions: presentation of a blueprint and implications for human-AI collaboration and sustainable IS.
high positive Schnitzel-Prediction: Designing Human-Ai Collaboration For C... guidance/blueprint for operational planning tool design
Two think-aloud sessions show that human judgment remains critical for high-uncertainty events.
Qualitative evaluation consisting of two think-aloud sessions reported in the paper.
high positive Schnitzel-Prediction: Designing Human-Ai Collaboration For C... importance/role of human judgment in handling high-uncertainty forecasting event...
Algorithmic benchmarking reduced forecast errors by 30% over naive baselines.
Quantitative algorithmic benchmarking reported in the evaluation section of the paper (comparison vs. naive baselines).
Because reputation-based, ex post sanctions cannot be relied upon for dissociative agents, governance should shift to observability-based, ex ante, constitutive, protocol-based behavioral harnesses.
Prescriptive recommendation derived from the theoretical critique of identity-based governance; paper proposes observability- and protocol-focused alternatives but does not present empirical tests or trials.
high positive Dissociative Identity: Language Model Agents Lack Grounding ... governance effectiveness of observability-based, ex ante protocol mechanisms
Reputation mechanisms function both as social signals and as corrective feedback that sustain an equilibrium of trustworthy behavior, presuming a persistent identity associated with behavioral continuity, sanction sensitivity, and costly non-fungibility.
Conceptual/theoretical argument presented in the paper drawing on reputation theory and social signaling; no empirical sample or quantitative data reported.
high positive Dissociative Identity: Language Model Agents Lack Grounding ... trustworthy behavior (sustaining equilibrium of trust)
Teams interacting with more embodied agents display conversational patterns that more closely resemble human–human dialogue.
Conversational analysis comparing dialogue patterns across teams interacting with different embodiment levels; the abstract reports greater similarity to human–human dialogue for teams with higher embodiment agents, but does not provide the similarity metric values or sample sizes.
high positive Teaming Up with Artificial Agents in Non-routine Analytical ... conversational pattern similarity to human–human dialogue
Human-only teams are more likely to complete all tasks successfully (higher task completion success) than mixed human–AI teams.
Comparison of task completion success between human-only teams and mixed teams in the escape room experiment as reported in the paper; no numerical completion rates provided in the abstract.
high positive Teaming Up with Artificial Agents in Non-routine Analytical ... task completion / success rate
Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.
Synthesis conclusion based on RADAR deployment results, telemetry (535K+ reviewed diffs, 331K+ landed), and comparative analyses (before-after and difference-in-differences) reported in the paper.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... reduction in review bottlenecks and preservation of production safety
RADAR reduces median diff review wall time by 35%.
Efficiency outcomes reported via telemetry and difference-in-differences analysis stated in the paper; median diff review wall time reduction reported as 35%. Sample likely drawn from RADAR telemetry (535K+ diffs) though not explicitly stated for this metric in the excerpt.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... median diff review wall time
RADAR reduces median time to close by over 330%.
Efficiency outcomes reported via telemetry and difference-in-differences analysis stated in the paper; median time-to-close reduction reported as 'over 330%'. Underlying sample for efficiency analysis likely from the RADAR telemetry (535K+ diffs), though the excerpt does not give the precise sample for this metric.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... median time to close for diffs
The Production Incident rate for RADAR-reviewed diffs is 1/50 that of non-RADAR diffs.
Comparative observational analysis reported in the paper; production incident rate for RADAR-reviewed diffs compared to non-RADAR diffs, with the relative rate given as 1/50. Exact absolute counts not provided in the excerpt; overall RADAR telemetry covers 535K+ diffs.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... production incident rate (RADAR vs non-RADAR)
The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs.
Comparative observational analysis reported in the paper contrasting RADAR-reviewed diffs with non-RADAR diffs. Underlying counts and exact sample split not provided in the excerpt; overall RADAR telemetry covers 535K+ diffs.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... diff revert rate (RADAR vs non-RADAR)
Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%.
Policy threshold comparison reported in the paper using observational before-after comparisons and system telemetry; approval rate reported as 60.31% after threshold change.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... approve rate of diffs under RADAR as a function of Diff Risk Score threshold
RADAR has reviewed 535K+ diffs and landed 331K+ changes.
System deployment telemetry reported in the paper: 'RADAR has reviewed 535K+ diffs and landed 331K+.'
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... number of diffs reviewed and diffs landed by RADAR
Agentic AI was responsible for over 80% of that growth in code volume.
Attribution analysis reported in the paper linking growth in code/diff volume to agentic AI sources; described as 'over 80% of that growth.' The underlying attribution method is not detailed in the excerpt.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... share of growth in code/diff volume attributable to agentic AI
Per-developer diff volume rose 51% (year over year) at Meta.
Internal telemetry/observational analysis reported in the paper; stated as a 51% increase in per-developer diff volume. No explicit sample size for this specific measure provided in the excerpt.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... per-developer diff volume (year-over-year change)
At Meta, significant lines of code per human-landed diff grew by 105.9% year over year.
Internal telemetry/observational analysis reported in the paper; stated as a year-over-year percentage growth for Meta. No sample size for this specific measure provided in the excerpt.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... lines of code per human-landed diff (year-over-year growth)
The concordance has many relevant applications in research and policy analyses of innovation.
Claim about the utility and applicability of the concordance stated by the authors; no enumeration of specific applications or empirical demonstrations included in excerpt.
high positive A concordance between patent and trademark classes to link t... potential applicability of the concordance for research and policy
The concordance can be used to track the diffusion of patented technologies at the technology, firm, region, or country level.
Stated intended applications of the concordance in the paper; excerpt does not present empirical case studies or performance metrics.
high positive A concordance between patent and trademark classes to link t... ability to track diffusion of patented technologies across multiple aggregation ...
We develop, validate and share a novel concordance between technology classes in patent records and market classes in trademark records.
Primary methodological contribution reported by the authors (development, validation, and sharing of a concordance); excerpt does not include validation method details or sample size.
high positive A concordance between patent and trademark classes to link t... existence and release of a concordance mapping patent technology classes to trad...
Patent and trademark data can be combined to link given technologies to specific markets.
Conceptual/methodological claim in paper proposing combination of patent and trademark records to map technologies to markets; excerpt does not include empirical validation details.
high positive A concordance between patent and trademark classes to link t... linkage between technologies (patents) and markets (trademarks)
Trademark filings that accompany the market introduction of new goods and services are a data source that can reveal the market introduction of technologies.
Descriptive claim in paper noting trademarks as a complementary data source to patents; no sample size or validation details in excerpt.
high positive A concordance between patent and trademark classes to link t... ability to detect market introduction of goods/services via trademark filings
Patent data is the preferred source of information for tracking technological change.
Statement in paper (introductory claim); no empirical sample or method reported in excerpt.
high positive A concordance between patent and trademark classes to link t... usefulness of patent data for tracking technological change
Restoring honest billing will require verification that ties reported token counts to evidence the provider does not control, such as trusted execution attestation, cryptographic proofs of inference, or third-party re-execution.
Policy/recommendation proposed by the authors based on their findings (argument that independent verification is necessary).
high positive Token Inflation: How Dishonest Providers Can Overcharge for ... requirements for restoring honest billing (types of verification needed)
Even when the user can see the full reasoning string, tokenization ambiguity alone still allows 50.85% over-reporting below the detection threshold.
Experimental result reported in the paper showing over-reporting due solely to tokenizer ambiguity when reasoning string is visible (no sample size in excerpt).
high positive Token Inflation: How Dishonest Providers Can Overcharge for ... percent over-reporting of billed tokens due to tokenization ambiguity
At current frontier reasoning prices, that turns a $100 honest bill into roughly a $1,569 bill on the same query.
Numerical example/price calculation based on the reported inflation (uses current frontier reasoning prices; calculation given by the authors).
high positive Token Inflation: How Dishonest Providers Can Overcharge for ... billed dollar amount for same query
In the most permissive setting, hidden reasoning usage can be inflated by 1,469% on average without detection.
Experimental/adversarial evaluation reported in the paper showing average inflation in a permissive audit setting (no sample size for queries provided in excerpt).
high positive Token Inflation: How Dishonest Providers Can Overcharge for ... percent over-reporting of hidden reasoning token usage
We study three recent token auditing frameworks and show that a provider with ordinary commercial capabilities can systematically inflate billed token counts.
Empirical/analytical evaluation of three token-auditing frameworks studied by the authors; adversarial provider simulation/experiment (paper states three frameworks were studied).
high positive Token Inflation: How Dishonest Providers Can Overcharge for ... ability to inflate billed token counts (systematic over-reporting)
Per-token billing is now the standard pricing model for commercial large language models (LLMs).
Author assertion about prevailing commercial pricing practices (no empirical sample or citation provided in excerpt).
high positive Token Inflation: How Dishonest Providers Can Overcharge for ... pricing model (per-token adoption)
We discuss implications for Information Systems (IS) design and propose future field evaluations.
Paper includes a discussion section outlining IS design implications and suggestions for future empirical/field work.
high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... proposed implications and future research directions
The approach preserves statistical rigour, traceability, and nuanced Persevere/Iterate decisions when accelerating experimentation.
Reported outcomes of controlled simulations and description of system design that enforces statistical procedures and logging; stated in manuscript as findings.
high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... statistical rigour, traceability, and decision quality in experimentation (Perse...
Logs render capabilities observable at the feature level, turning 'agentic AI' into a disciplined experimentation infrastructure rather than a generic assistant.
Implementation logs and descriptions from the Node.js instantiation reported in the paper; qualitative claim about observability and traceability at the feature level.
high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... feature-level observability/traceability of experimentation activities
The Multi Agent System reduces time-to-validated-learning by roughly an order of magnitude while preserving statistical rigour, traceability, and nuanced Persevere/Iterate decisions.
Results from the controlled simulations reported in the paper (comparison between agentic multi-agent system and manual B-M-L cycles).
high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... time-to-validated-learning (and preservation of statistical rigour, traceability...
Controlled simulations compare agentic and manual B-M-L cycles on feature ideas.
Reported controlled simulation experiments in the paper comparing agentic (multi-agent) and manual B-M-L cycles; methodological description present in manuscript.
high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... comparison of agentic vs manual B-M-L cycles (experimentation performance metric...