Evidence (13827 claims)
Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 195 | 97 | 889 | 1979 |
| Governance & Regulation | 815 | 391 | 188 | 121 | 1539 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 624 | 233 | 123 | 96 | 1084 |
| Research Productivity | 410 | 121 | 56 | 331 | 929 |
| Output Quality | 466 | 177 | 59 | 47 | 749 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 166 | 122 | 24 | 495 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 165 | 57 | 60 | 17 | 299 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Employment Level | 105 | 51 | 107 | 13 | 278 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 149 | 46 | 26 | 3 | 224 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 61 | 20 | 12 | 182 |
| Error Rate | 69 | 91 | 10 | 2 | 172 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 92 | 19 | 13 | 19 | 145 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Skill Obsolescence | 5 | 45 | 6 | 1 | 57 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
We curated real evidence images together with their associated review and product metadata, identified genuine damaged and undamaged evidence through MLLM-assisted filtering and human annotation.
Data curation pipeline combining multimodal large language model (MLLM) filtering and human annotation as described in the methods.
FraudBench is constructed from real-world user-review evidence across e-commerce, food delivery, and travel-service scenarios.
Dataset construction procedure described in the paper specifying source domains (e-commerce, food delivery, travel services).
We introduce FraudBench, a multimodal benchmark for detecting AI-generated fraudulent refund evidence.
Methodological contribution described in the paper: design and release of a benchmark dataset (FraudBench).
A digital twin analytics platform validation shows that a single codebase with domain-specific ontology configurations eliminates tool-call hallucination and achieves cross-domain configurability without application code changes.
Validation/demonstration reported in the paper using a digital twin analytics platform; platform demonstration claimed to eliminate tool-call hallucination and enable cross-domain configurability via configuration only.
In the same controlled experiment, ontology-grounded parameters reduced domain-identifier hallucination to 0%.
Same controlled experiment (six industry configurations, 72 tool invocations with Qwen3-32B) reported in the paper; ontology-grounded parameter condition produced 0% hallucination.
The architecture is formalized as a three-operation interface contract — resolve, contextualize, annotate — with invariants enforced by an AIOps orchestration layer.
Design specification and formalization presented in the paper (architectural description).
Embedding manufacturing ontology directly into the AI tool layer as a typed relational configuration enforces semantic constraints at runtime and closes the semantic training gap.
Proposed system architecture described and argued in the paper; validated via demonstrations and experiments described later in the paper.
This budget-split approach is responsive to the needs of real-world, resource-constrained advertisers committed to equitable distribution of public service outreach via online advertising.
Authors' normative/qualitative conclusion based on the implemented intervention and its practical suitability for government advertisers; no empirical quantification provided in excerpt.
The budget split intervention is a valuable approach to addressing ad delivery skew without excluding unknown users.
Authors' empirical finding from the collaboration/intervention (paper reports results from implemented intervention; specific metrics, sample size, and quantitative results are not provided in the excerpt).
In the absence of platform-provided solutions to skewed ad delivery, advertisers can counteract skew by targeting demographic groups directly.
Descriptive claim about common advertiser strategies; motivated by platform capability gaps (no experimental/sample details in excerpt).
Sustainable progress requires collaborative integration of humans and machines, rather than replacement.
Normative conclusion/recommendation stated in the paper based on study findings (argument for augmented intelligence over replacement).
This research presents the innovative Marketing Intelligence Operations (MIO) Framework and a practical AI Adoption Readiness Scorecard, enabling leaders to manage the operational balance between transformative efficiency improvements and human capital vulnerability.
Paper states that it introduces a new framework and a practical scorecard as deliverables of the research (descriptive claim about the paper's contributions).
AI-integrated Marketing Intelligence Operations (MIO) quantitatively improves campaign Return on Investment (ROI) by 47%.
Reported as an empirical result from the paper's mixed-methods study (the paper states use of audits, surveys, and NLP analysis to evaluate MIO outcomes).
Deploying LegalCheck in the Municipality of Amsterdam demonstrated substantial efficiency gains, improved legal consistency, and positive user acceptance.
Summary claim based on the real-world deployment outcomes described in the paper (timing improvements, consistency/factual accuracy statements, and reported positive reception by professionals); specific quantitative metrics and sample sizes are not fully reported in the excerpt.
The system produced explainable outputs based on actual regulations and prior cases, providing citations/explainability that support legal reasoning.
Paper describes retrieval from curated legal knowledge bases and generation of outputs grounded in regulations and prior cases during the Amsterdam deployment; presented as a feature of the system and supported by expert review.
LegalCheck uses a combination of Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG) with curated legal knowledge bases and controlled prompting to retrieve relevant laws and precedents and incorporate case-specific details into coherent drafts.
System architecture and methodology described in the paper (design/implementation claim).
Legal professionals found that the system ensured a consistent application of legal standards without replacing human judgment.
Reported qualitative feedback from professionals in the Municipality of Amsterdam deployment and the system design that includes an expert-in-the-loop review; no formal measurement of 'replacement' was reported.
Legal professionals found that the system reduced their workload.
Reported user feedback from legal professionals during the Municipality of Amsterdam deployment; qualitative statements that professionals experienced workload reduction (no numeric workload metrics or sample size reported).
The system's output captured the vast majority of required legal reasoning—often 80% to 100% of essential content.
Reported coverage statistic from the deployment/evaluation described in the paper (phrased as 'often 80% to 100% of essential content'); exact evaluation method, sample size, and measurement protocol are not provided in the excerpt.
LegalCheck maintained high legal consistency and factual accuracy when generating draft letters.
Evaluation during real-world deployment with expert-in-the-loop review and feedback from legal professionals in the Municipality of Amsterdam; claims of high consistency and factual accuracy are reported but no formal numeric accuracy metric or sample size is provided in the text.
LegalCheck produced near-final advice letters in minutes rather than hours.
Reported results from a real-world deployment within the Municipality of Amsterdam; system logs / timing comparisons between human drafting time (hours) and LegalCheck-assisted drafting time (minutes) are described in the paper (no explicit numeric sample size reported).
We outline a research program for the runtime systems that foundation-model software agents will require.
Paper claims to present a forward-looking research agenda or program (stated in abstract); this is a conceptual contribution rather than an empirical finding.
Applied to a controlled validation task, the framework yields episode packages whose evidence structure varies systematically with harness level: lower levels produce only a final patch, while higher levels produce reproduction logs, failure attributions, deterministic requirement checks, and structured verification reports.
Empirical application described in the abstract: framework applied to a controlled validation task showing systematic variation in episode-package evidence structure across harness levels. The abstract does not report sample size or statistical measures.
We propose a trace-based evaluation protocol that converts each agent run into an auditable episode package.
Methodological proposal described in the abstract proposing a trace-based protocol and an auditable episode package format; no quantitative evaluation details provided in the abstract.
We operationalize the harness through a four-level ladder (H0–H3) that progressively exposes runtime support to the agent.
Design contribution described in the paper (abstract) introducing a four-level ladder (H0–H3) as an operationalization of the harness concept.
Foundation models have transformed automated code generation.
Statement in paper's abstract referring to broad impact of foundation models on automated code generation; likely supported by citations and literature overview within the paper (no sample size or quantitative study reported in the abstract).
Authorship preservation should be a design priority for AI tools deployed in identity-relevant, behavior-dependent tasks.
Authors' recommendation based on experimental results showing negative motivational and behavioral consequences of delegating authorship to LLMs despite improved objective goal quality.
Mediation analyses identified psychological ownership as the mechanism: it mediated the authorship effect on every downstream motivational and behavioral outcome, while objective goal quality did not.
Mediation analyses reported in the preregistered experiment (authors tested psychological ownership and objective goal quality as mediators of authorship effects on multiple downstream outcomes); preregistered N = 470.
At two-week follow-up, 72.8% of self-authored participants had acted on two or more of their goals, compared to 46.6% in the LLM condition.
Behavioral follow-up measure collected two weeks after the intervention in the preregistered experiment; percentages reported in the paper/abstract. (Follow-up completion N not specified in the abstract.)
LLM-generated goals scored higher on SMART criteria (specificity, measurability, achievability, relevance, and time-boundedness).
Preregistered randomized experiment comparing self-authored vs LLM-authored goals derived from a personal reflection; reported effect size d = 2.26; total preregistered N = 470.
As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior.
Intervention experiments applying PGLA to model decoding on IMAVB; reported consistent improvements in the models' tendency to reject misleading premises after logit adjustment guided by probes.
We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension.
Description of new benchmark introduced in paper: 500 clips, 2x2 design (vision vs audio × standard vs misleading premises); used to measure conflict detection independently of standard multimodal QA.
The Agent-First paradigm is orthogonal and complementary to transport-layer standards such as MCP, operating as the semantic application layer above existing tool discovery and invocation protocols.
Conceptual argument and mapping presented in the paper asserting interoperability/orthogonality with transport-layer standards (e.g., MCP).
Agent-First APIs improve autonomous error recovery by 5.8x (compared to optimized CRUD baselines).
Reported comparative experiments on 50 real operational tasks measuring autonomous error recovery capability.
Agent-First APIs reduce required human interventions by 72.7% (compared to optimized CRUD baselines).
Same set of comparative experiments on 50 real operational tasks reported in the paper.
Comparative experiments on 50 real operational tasks demonstrate that Agent-First APIs achieve 88% end-to-end task success rate versus 64% for optimized CRUD baselines (+37.5%).
Empirical comparative experiments reported in the paper on 50 real operational tasks, comparing Agent-First APIs to optimized CRUD baselines.
The paradigm is implemented and validated in a production multi-tenant SaaS platform serving 85 registered tools across 6 business domains.
Reported production implementation and deployment statistics (platform with 85 registered tools spanning 6 business domains).
We propose the Agent-First Tool API paradigm, comprising three integrated mechanisms: (1) a Six-Verb Semantic Protocol that decomposes tool interactions into search, resolve, preview, execute, verify, and recover phases; (2) a Normalized Tool Contract (NTC) providing structured decision-support metadata including confidence scores, evidence chains, and suggested next actions; and (3) a dual-layer governance pipeline combining static capability policies with dynamic risk escalation.
Design and specification presented in the paper (proposed architecture and components).
LLMs can help generate more correct and functional code compared to participant-generated solutions.
Comparative analysis of generated solutions reported in the paper (no sample-size for solutions explicitly stated in the abstract). The paper states LLM-assisted solutions were more correct/functional.
Qualitative analysis of participants' interactions and interviews revealed four different human-LLM collaboration modes supporting various problem-solving strategies.
Qualitative analysis of interaction logs and retrospective interviews from the study participants (N=20) reported in the paper; identification of four collaboration modes described.
We conducted a within-subject study followed by retrospective interviews with programmers (N=20).
Stated methods in the paper: within-subject experimental design plus retrospective interviews; sample size explicitly given as N=20.
Organizations classified as 'Proactive Integrators' can reduce the risk of obsolescence by up to 53%.
Subgroup finding reported in the study (reduction estimate for organizations labeled 'Proactive Integrators'); specific subgroup sample not provided in abstract.
AI-assisted engineering teams can achieve a 24% increase in productivity.
Empirical finding reported by the study, derived from the mixed-methods analysis (survey of 320 orgs, Delphi with 40 experts, and case studies of 5 industries as described in abstract).
Entities that strategically implement AI can enhance their innovation cycles by up to 30%.
Statement in paper (presented as a forecast/estimate; no specific study or sample detailed in abstract).
Frontier directions include differentiable token budgets and dynamic markets to lay the theoretical foundation for scalable next-generation agent systems.
Paper's conclusion/recommendations based on surveyed literature and identified gaps; presented as proposed future research directions rather than empirically validated findings.
Security: Internalizing adversarial threats as endogenous economic constraints.
Authors argue for modeling adversarial threats within the economic/tokens framework as endogenous constraints; conceptual/theoretical claim from the survey.
Macro-level (Agent Ecosystems): Addressing congestion externalities and pricing via mechanism design.
Paper posits mechanism-design approaches to tackle congestion externalities and pricing in agent ecosystems; conceptual proposal based on economic theory and literature synthesis.
Meso-level (Multi-Agent Systems): Minimizing collaboration friction using transaction cost and principal-agent theories.
Authors propose applying transaction-cost and principal-agent frameworks to multi-agent token interactions; presented as a theoretical taxonomy/synthesis without reported empirical sample.
Micro-level (Single Agent): Optimizing budget-constrained factor substitution via neoclassical firm theory.
The paper asserts a micro-level taxonomy using neoclassical firm theory to model single-agent token-budget optimization; presented as conceptual/theoretical mapping rather than empirical test.
We conceptualize tokens as production factors, exchange mediums, and units of account.
Paper provides a conceptual taxonomy framing tokens in three economic roles; based on theoretical argumentation and literature synthesis.