Evidence (13870 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	196	98	892	1984
Governance & Regulation	817	394	188	121	1544
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	627	233	123	96	1088
Research Productivity	411	123	56	332	933
Output Quality	467	178	59	47	751
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	167	122	24	496
Task Allocation	207	64	71	32	379
Skill Acquisition	165	59	60	17	301
Innovation Output	203	27	43	18	292
Employment Level	105	52	107	13	279
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	150	48	26	3	227
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	63	20	12	184
Error Rate	69	92	10	2	173
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	93	21	13	19	148
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	17	7	3	59
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Changes in skill demand in online labour markets are an outcome of introducing platform-embedded GenAI.

Synthesis of the study's empirical findings (difference-in-differences results showing increased skill diversity in logo jobs post-logo-AI and mediation evidence via competition) leading to the broader conclusion that platform-embedded GenAI can change skill demand on online labour platforms.

high positive Exploring The Effect Of Platform-Embedded Generative Ai On S... skill demand (changes in requested skills on online labour platforms)

Stronger competition among freelancers partially mediates the effect of the platform-embedded logo-AI on higher skill diversity in logo jobs.

Mediation analysis within the difference-in-differences framework linking measures of freelancer competition to changes in requested skill diversity after the logo-AI launch. Specific mediation estimation details and sample size not provided in the abstract.

high positive Exploring The Effect Of Platform-Embedded Generative Ai On S... skill diversity (mediated by freelancer competition)

Logo jobs exhibit higher skill diversity than other design jobs after the platform introduced logo-AI.

Difference-in-differences comparison of skill-diversity metrics extracted via the authors' LLM-based skill extraction and embedding framework on EPWK job posts for logo design (treatment) versus other design jobs (control), pre- and post-introduction of the platform-embedded logo-AI tool. Sample size not reported in the abstract.

high positive Exploring The Effect Of Platform-Embedded Generative Ai On S... requested skill diversity in job posts

Ongoing efforts of the initiative aim to incorporate benchmarks that address concerns about bias by considering alternative perspectives and human centered use cases.

Statement of planned/ongoing work in the paper regarding future benchmark inclusion to address bias and human-centered use cases; no empirical results provided.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... planned incorporation of bias-aware benchmarks and human-centered use case consi...

Implemented tests include causal translation, model iteration, causal reasoning, conformance, model behavior explanation, suggested model building steps, and suggested model fixes.

Specific list of implemented test categories provided in the paper; descriptive/reporting evidence from the initiative's work.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... types/categories of tests implemented

Tests for several distinct categories of evaluation have been implemented and applied to AI tools that support qualitative model building, quantitative model building, and model discussion.

Paper reports that a set of tests have been implemented and applied to AI tools across qualitative and quantitative modeling and discussion; no sample sizes or numeric evaluation results provided in the excerpt.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... existence and application of implemented evaluation tests across types of modeli...

A steering group focuses on prioritizing potential benchmarks, while a technical group focuses on implementing the benchmarks in the form of automated tests.

Organizational description in the paper specifying roles (steering group and technical group); no quantitative evaluation reported.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... organizational roles for benchmark prioritization and implementation

The open source sd ai project hosted by the initiative establishes transparency and enables contributions to be shared broadly.

Descriptive statement about the open-source project hosted by the initiative; no empirical measures of transparency or contribution sharing provided.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... transparency and breadth of contributions enabled by the open source sd ai proje...

The initiative uses open digital and organizational infrastructure to collaboratively evaluate AI tools for modeling and simulation.

Descriptive claim in the paper about organizational approach (open infrastructure and collaborative evaluation); no empirical testing or sample size reported.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... use of open infrastructure for collaborative evaluation

The BEAMS Initiative aims to guide the development of AI tools for modeling and simulation toward forms that are responsible and ethical by establishing benchmarks for human centered modeling and simulation practices.

Descriptive statement about the Initiative's stated aims and purpose in the paper; organizational description rather than empirical evidence.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... existence and purpose of the BEAMS Initiative (benchmarking for responsible/ethi...

Tools that can automate aspects of modeling practice must complement human expertise, not replace it.

Normative claim made in the paper (argument about human-centered design); no empirical evidence or sample size reported.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... relationship between automated modeling tools and human expertise (complementari...

AI tools to support real world decision making must be able to build simulation models that inform their recommendations and render them interpretable.

Normative assertion in the paper (position statement / requirement); no empirical study or sample size reported.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... ability of AI tools to build interpretable simulation models that inform recomme...

The agentic future is not predetermined; leaders must both skate to where the puck is going and actively steer it toward a good place, ensuring innovation delivers welfare gains felt by businesses and consumers around the world.

Normative recommendation offered by the authors; based on conceptual argument and interpretation of the framework rather than empirical testing in the excerpt.

high positive From Augmentation to Reconstruction: Guiding the AI Disrupti... policy/leadership influence on welfare distribution of AI-driven innovation

These complementary investments produce the familiar 'productivity J-curve' of general-purpose technologies.

Stated as an economic analogy/claim drawing on general-purpose technology literature; presented as an asserted mechanism rather than shown with new empirical estimates in the excerpt.

high positive From Augmentation to Reconstruction: Guiding the AI Disrupti... productivity trajectory (J-curve) following complementary investments

The most consequential disruption resides in the third stage (Reconstruction) where workflows and markets are rebuilt around delegation, machine-to-machine interaction, continuous monitoring, and auditable constraints.

Theoretical claim in the paper backed by conceptual reasoning and illustrative sector examples; no quantitative evidence provided in the excerpt.

high positive From Augmentation to Reconstruction: Guiding the AI Disrupti... magnitude/importance of disruption arising from Reconstruction-stage changes

The system preserves human agency via override mechanisms.

Design description of the collaborative forecasting system that explicitly includes override controls for human users.

high positive Schnitzel-Prediction: Designing Human-Ai Collaboration For C... preservation of human agency (ability to override algorithmic forecasts)

The paper provides a rigorous blueprint for designing synergistic, trustworthy, and diagnostic operational planning tools, contributing to the discourse on human-AI collaboration and sustainable information systems (IS).

Stated contribution in the paper's conclusions: presentation of a blueprint and implications for human-AI collaboration and sustainable IS.

high positive Schnitzel-Prediction: Designing Human-Ai Collaboration For C... guidance/blueprint for operational planning tool design

Two think-aloud sessions show that human judgment remains critical for high-uncertainty events.

Qualitative evaluation consisting of two think-aloud sessions reported in the paper.

high positive Schnitzel-Prediction: Designing Human-Ai Collaboration For C... importance/role of human judgment in handling high-uncertainty forecasting event...

Algorithmic benchmarking reduced forecast errors by 30% over naive baselines.

Quantitative algorithmic benchmarking reported in the evaluation section of the paper (comparison vs. naive baselines).

high positive Schnitzel-Prediction: Designing Human-Ai Collaboration For C... forecast error

Because reputation-based, ex post sanctions cannot be relied upon for dissociative agents, governance should shift to observability-based, ex ante, constitutive, protocol-based behavioral harnesses.

Prescriptive recommendation derived from the theoretical critique of identity-based governance; paper proposes observability- and protocol-focused alternatives but does not present empirical tests or trials.

high positive Dissociative Identity: Language Model Agents Lack Grounding ... governance effectiveness of observability-based, ex ante protocol mechanisms

Reputation mechanisms function both as social signals and as corrective feedback that sustain an equilibrium of trustworthy behavior, presuming a persistent identity associated with behavioral continuity, sanction sensitivity, and costly non-fungibility.

Conceptual/theoretical argument presented in the paper drawing on reputation theory and social signaling; no empirical sample or quantitative data reported.

high positive Dissociative Identity: Language Model Agents Lack Grounding ... trustworthy behavior (sustaining equilibrium of trust)

Teams interacting with more embodied agents display conversational patterns that more closely resemble human–human dialogue.

Conversational analysis comparing dialogue patterns across teams interacting with different embodiment levels; the abstract reports greater similarity to human–human dialogue for teams with higher embodiment agents, but does not provide the similarity metric values or sample sizes.

high positive Teaming Up with Artificial Agents in Non-routine Analytical ... conversational pattern similarity to human–human dialogue

Human-only teams are more likely to complete all tasks successfully (higher task completion success) than mixed human–AI teams.

Comparison of task completion success between human-only teams and mixed teams in the escape room experiment as reported in the paper; no numerical completion rates provided in the abstract.

high positive Teaming Up with Artificial Agents in Non-routine Analytical ... task completion / success rate

Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.

Synthesis conclusion based on RADAR deployment results, telemetry (535K+ reviewed diffs, 331K+ landed), and comparative analyses (before-after and difference-in-differences) reported in the paper.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... reduction in review bottlenecks and preservation of production safety

RADAR reduces median diff review wall time by 35%.

Efficiency outcomes reported via telemetry and difference-in-differences analysis stated in the paper; median diff review wall time reduction reported as 35%. Sample likely drawn from RADAR telemetry (535K+ diffs) though not explicitly stated for this metric in the excerpt.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... median diff review wall time

RADAR reduces median time to close by over 330%.

Efficiency outcomes reported via telemetry and difference-in-differences analysis stated in the paper; median time-to-close reduction reported as 'over 330%'. Underlying sample for efficiency analysis likely from the RADAR telemetry (535K+ diffs), though the excerpt does not give the precise sample for this metric.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... median time to close for diffs

The Production Incident rate for RADAR-reviewed diffs is 1/50 that of non-RADAR diffs.

Comparative observational analysis reported in the paper; production incident rate for RADAR-reviewed diffs compared to non-RADAR diffs, with the relative rate given as 1/50. Exact absolute counts not provided in the excerpt; overall RADAR telemetry covers 535K+ diffs.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... production incident rate (RADAR vs non-RADAR)

The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs.

Comparative observational analysis reported in the paper contrasting RADAR-reviewed diffs with non-RADAR diffs. Underlying counts and exact sample split not provided in the excerpt; overall RADAR telemetry covers 535K+ diffs.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... diff revert rate (RADAR vs non-RADAR)

Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%.

Policy threshold comparison reported in the paper using observational before-after comparisons and system telemetry; approval rate reported as 60.31% after threshold change.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... approve rate of diffs under RADAR as a function of Diff Risk Score threshold

RADAR has reviewed 535K+ diffs and landed 331K+ changes.

System deployment telemetry reported in the paper: 'RADAR has reviewed 535K+ diffs and landed 331K+.'

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... number of diffs reviewed and diffs landed by RADAR

Agentic AI was responsible for over 80% of that growth in code volume.

Attribution analysis reported in the paper linking growth in code/diff volume to agentic AI sources; described as 'over 80% of that growth.' The underlying attribution method is not detailed in the excerpt.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... share of growth in code/diff volume attributable to agentic AI

Per-developer diff volume rose 51% (year over year) at Meta.

Internal telemetry/observational analysis reported in the paper; stated as a 51% increase in per-developer diff volume. No explicit sample size for this specific measure provided in the excerpt.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... per-developer diff volume (year-over-year change)

At Meta, significant lines of code per human-landed diff grew by 105.9% year over year.

Internal telemetry/observational analysis reported in the paper; stated as a year-over-year percentage growth for Meta. No sample size for this specific measure provided in the excerpt.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... lines of code per human-landed diff (year-over-year growth)

The concordance has many relevant applications in research and policy analyses of innovation.

Claim about the utility and applicability of the concordance stated by the authors; no enumeration of specific applications or empirical demonstrations included in excerpt.

high positive A concordance between patent and trademark classes to link t... potential applicability of the concordance for research and policy

The concordance can be used to track the diffusion of patented technologies at the technology, firm, region, or country level.

Stated intended applications of the concordance in the paper; excerpt does not present empirical case studies or performance metrics.

high positive A concordance between patent and trademark classes to link t... ability to track diffusion of patented technologies across multiple aggregation ...

We develop, validate and share a novel concordance between technology classes in patent records and market classes in trademark records.

Primary methodological contribution reported by the authors (development, validation, and sharing of a concordance); excerpt does not include validation method details or sample size.

high positive A concordance between patent and trademark classes to link t... existence and release of a concordance mapping patent technology classes to trad...

Patent and trademark data can be combined to link given technologies to specific markets.

Conceptual/methodological claim in paper proposing combination of patent and trademark records to map technologies to markets; excerpt does not include empirical validation details.

high positive A concordance between patent and trademark classes to link t... linkage between technologies (patents) and markets (trademarks)

Trademark filings that accompany the market introduction of new goods and services are a data source that can reveal the market introduction of technologies.

Descriptive claim in paper noting trademarks as a complementary data source to patents; no sample size or validation details in excerpt.

high positive A concordance between patent and trademark classes to link t... ability to detect market introduction of goods/services via trademark filings

Patent data is the preferred source of information for tracking technological change.

Statement in paper (introductory claim); no empirical sample or method reported in excerpt.

high positive A concordance between patent and trademark classes to link t... usefulness of patent data for tracking technological change

Restoring honest billing will require verification that ties reported token counts to evidence the provider does not control, such as trusted execution attestation, cryptographic proofs of inference, or third-party re-execution.

Policy/recommendation proposed by the authors based on their findings (argument that independent verification is necessary).

high positive Token Inflation: How Dishonest Providers Can Overcharge for ... requirements for restoring honest billing (types of verification needed)

Even when the user can see the full reasoning string, tokenization ambiguity alone still allows 50.85% over-reporting below the detection threshold.

Experimental result reported in the paper showing over-reporting due solely to tokenizer ambiguity when reasoning string is visible (no sample size in excerpt).

high positive Token Inflation: How Dishonest Providers Can Overcharge for ... percent over-reporting of billed tokens due to tokenization ambiguity

At current frontier reasoning prices, that turns a $100 honest bill into roughly a $1,569 bill on the same query.

Numerical example/price calculation based on the reported inflation (uses current frontier reasoning prices; calculation given by the authors).

high positive Token Inflation: How Dishonest Providers Can Overcharge for ... billed dollar amount for same query

In the most permissive setting, hidden reasoning usage can be inflated by 1,469% on average without detection.

Experimental/adversarial evaluation reported in the paper showing average inflation in a permissive audit setting (no sample size for queries provided in excerpt).

high positive Token Inflation: How Dishonest Providers Can Overcharge for ... percent over-reporting of hidden reasoning token usage

We study three recent token auditing frameworks and show that a provider with ordinary commercial capabilities can systematically inflate billed token counts.

Empirical/analytical evaluation of three token-auditing frameworks studied by the authors; adversarial provider simulation/experiment (paper states three frameworks were studied).

high positive Token Inflation: How Dishonest Providers Can Overcharge for ... ability to inflate billed token counts (systematic over-reporting)

Per-token billing is now the standard pricing model for commercial large language models (LLMs).

Author assertion about prevailing commercial pricing practices (no empirical sample or citation provided in excerpt).

high positive Token Inflation: How Dishonest Providers Can Overcharge for ... pricing model (per-token adoption)

We discuss implications for Information Systems (IS) design and propose future field evaluations.

Paper includes a discussion section outlining IS design implications and suggestions for future empirical/field work.

high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... proposed implications and future research directions

The approach preserves statistical rigour, traceability, and nuanced Persevere/Iterate decisions when accelerating experimentation.

Reported outcomes of controlled simulations and description of system design that enforces statistical procedures and logging; stated in manuscript as findings.

high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... statistical rigour, traceability, and decision quality in experimentation (Perse...

Logs render capabilities observable at the feature level, turning 'agentic AI' into a disciplined experimentation infrastructure rather than a generic assistant.

Implementation logs and descriptions from the Node.js instantiation reported in the paper; qualitative claim about observability and traceability at the feature level.

high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... feature-level observability/traceability of experimentation activities

The Multi Agent System reduces time-to-validated-learning by roughly an order of magnitude while preserving statistical rigour, traceability, and nuanced Persevere/Iterate decisions.

Results from the controlled simulations reported in the paper (comparison between agentic multi-agent system and manual B-M-L cycles).

high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... time-to-validated-learning (and preservation of statistical rigour, traceability...

Controlled simulations compare agentic and manual B-M-L cycles on feature ideas.

Reported controlled simulation experiments in the paper comparing agentic (multi-agent) and manual B-M-L cycles; methodological description present in manuscript.

high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... comparison of agentic vs manual B-M-L cycles (experimentation performance metric...

« Prev 1 2 3 … 100 101 102 … 277 278 Next »