Evidence (13827 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	195	97	889	1979
Governance & Regulation	815	391	188	121	1539
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	624	233	123	96	1084
Research Productivity	410	121	56	331	929
Output Quality	466	177	59	47	749
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	166	122	24	495
Task Allocation	206	64	70	31	376
Skill Acquisition	165	57	60	17	299
Innovation Output	201	27	41	18	288
Employment Level	105	51	107	13	278
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	149	46	26	3	224
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	61	20	12	182
Error Rate	69	91	10	2	172
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	92	19	13	19	145
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Skill Obsolescence	5	45	6	1	57
Creative Output	31	16	7	2	57
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

We curated real evidence images together with their associated review and product metadata, identified genuine damaged and undamaged evidence through MLLM-assisted filtering and human annotation.

Data curation pipeline combining multimodal large language model (MLLM) filtering and human annotation as described in the methods.

high positive FraudBench: A Multimodal Benchmark for Detecting AI-Generate... label quality (genuine damaged vs undamaged) via MLLM-assisted filtering and hum...

FraudBench is constructed from real-world user-review evidence across e-commerce, food delivery, and travel-service scenarios.

Dataset construction procedure described in the paper specifying source domains (e-commerce, food delivery, travel services).

high positive FraudBench: A Multimodal Benchmark for Detecting AI-Generate... coverage of real-world domains in dataset

We introduce FraudBench, a multimodal benchmark for detecting AI-generated fraudulent refund evidence.

Methodological contribution described in the paper: design and release of a benchmark dataset (FraudBench).

high positive FraudBench: A Multimodal Benchmark for Detecting AI-Generate... availability of a benchmark dataset for claim-conditioned fraudulent evidence de...

A digital twin analytics platform validation shows that a single codebase with domain-specific ontology configurations eliminates tool-call hallucination and achieves cross-domain configurability without application code changes.

Validation/demonstration reported in the paper using a digital twin analytics platform; platform demonstration claimed to eliminate tool-call hallucination and enable cross-domain configurability via configuration only.

high positive The Semantic Training Gap: Ontology-Grounded Tool Architectu... tool-call hallucination elimination and cross-domain configurability without app...

In the same controlled experiment, ontology-grounded parameters reduced domain-identifier hallucination to 0%.

Same controlled experiment (six industry configurations, 72 tool invocations with Qwen3-32B) reported in the paper; ontology-grounded parameter condition produced 0% hallucination.

high positive The Semantic Training Gap: Ontology-Grounded Tool Architectu... hallucination rate for domain identifiers (ontology-grounded condition)

The architecture is formalized as a three-operation interface contract — resolve, contextualize, annotate — with invariants enforced by an AIOps orchestration layer.

Design specification and formalization presented in the paper (architectural description).

high positive The Semantic Training Gap: Ontology-Grounded Tool Architectu... existence of a three-operation interface contract and invariant enforcement

Embedding manufacturing ontology directly into the AI tool layer as a typed relational configuration enforces semantic constraints at runtime and closes the semantic training gap.

Proposed system architecture described and argued in the paper; validated via demonstrations and experiments described later in the paper.

high positive The Semantic Training Gap: Ontology-Grounded Tool Architectu... enforcement of semantic constraints at runtime / closure of semantic gap

This budget-split approach is responsive to the needs of real-world, resource-constrained advertisers committed to equitable distribution of public service outreach via online advertising.

Authors' normative/qualitative conclusion based on the implemented intervention and its practical suitability for government advertisers; no empirical quantification provided in excerpt.

high positive Into the Unknown: Accounting for Missing Demographic Data wh... practical suitability / responsiveness of the intervention for resource-constrai...

The budget split intervention is a valuable approach to addressing ad delivery skew without excluding unknown users.

Authors' empirical finding from the collaboration/intervention (paper reports results from implemented intervention; specific metrics, sample size, and quantitative results are not provided in the excerpt).

high positive Into the Unknown: Accounting for Missing Demographic Data wh... reduction of gender-based ad delivery skew while maintaining inclusion of unknow...

In the absence of platform-provided solutions to skewed ad delivery, advertisers can counteract skew by targeting demographic groups directly.

Descriptive claim about common advertiser strategies; motivated by platform capability gaps (no experimental/sample details in excerpt).

high positive Into the Unknown: Accounting for Missing Demographic Data wh... ability of advertisers to mitigate ad delivery skew via direct demographic targe...

Sustainable progress requires collaborative integration of humans and machines, rather than replacement.

Normative conclusion/recommendation stated in the paper based on study findings (argument for augmented intelligence over replacement).

high positive Augmented Intelligence: Resolving the AI integration-obsoles... approach to AI-human integration

This research presents the innovative Marketing Intelligence Operations (MIO) Framework and a practical AI Adoption Readiness Scorecard, enabling leaders to manage the operational balance between transformative efficiency improvements and human capital vulnerability.

Paper states that it introduces a new framework and a practical scorecard as deliverables of the research (descriptive claim about the paper's contributions).

high positive Augmented Intelligence: Resolving the AI integration-obsoles... AI adoption readiness / operational management capability

AI-integrated Marketing Intelligence Operations (MIO) quantitatively improves campaign Return on Investment (ROI) by 47%.

Reported as an empirical result from the paper's mixed-methods study (the paper states use of audits, surveys, and NLP analysis to evaluate MIO outcomes).

high positive Augmented Intelligence: Resolving the AI integration-obsoles... campaign Return on Investment (ROI)

Deploying LegalCheck in the Municipality of Amsterdam demonstrated substantial efficiency gains, improved legal consistency, and positive user acceptance.

Summary claim based on the real-world deployment outcomes described in the paper (timing improvements, consistency/factual accuracy statements, and reported positive reception by professionals); specific quantitative metrics and sample sizes are not fully reported in the excerpt.

high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... efficiency (time), legal consistency, user acceptance

The system produced explainable outputs based on actual regulations and prior cases, providing citations/explainability that support legal reasoning.

Paper describes retrieval from curated legal knowledge bases and generation of outputs grounded in regulations and prior cases during the Amsterdam deployment; presented as a feature of the system and supported by expert review.

high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... explainability / traceability of generated legal reasoning to source regulations...

LegalCheck uses a combination of Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG) with curated legal knowledge bases and controlled prompting to retrieve relevant laws and precedents and incorporate case-specific details into coherent drafts.

System architecture and methodology described in the paper (design/implementation claim).

high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... n/a (system design / method description)

Legal professionals found that the system ensured a consistent application of legal standards without replacing human judgment.

Reported qualitative feedback from professionals in the Municipality of Amsterdam deployment and the system design that includes an expert-in-the-loop review; no formal measurement of 'replacement' was reported.

high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... consistency in application of legal standards and preservation of human oversigh...

Legal professionals found that the system reduced their workload.

Reported user feedback from legal professionals during the Municipality of Amsterdam deployment; qualitative statements that professionals experienced workload reduction (no numeric workload metrics or sample size reported).

high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... perceived workload of legal professionals

The system's output captured the vast majority of required legal reasoning—often 80% to 100% of essential content.

Reported coverage statistic from the deployment/evaluation described in the paper (phrased as 'often 80% to 100% of essential content'); exact evaluation method, sample size, and measurement protocol are not provided in the excerpt.

high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... proportion of essential legal reasoning/content captured in generated drafts

LegalCheck maintained high legal consistency and factual accuracy when generating draft letters.

Evaluation during real-world deployment with expert-in-the-loop review and feedback from legal professionals in the Municipality of Amsterdam; claims of high consistency and factual accuracy are reported but no formal numeric accuracy metric or sample size is provided in the text.

high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... legal consistency and factual accuracy of generated letters

LegalCheck produced near-final advice letters in minutes rather than hours.

Reported results from a real-world deployment within the Municipality of Amsterdam; system logs / timing comparisons between human drafting time (hours) and LegalCheck-assisted drafting time (minutes) are described in the paper (no explicit numeric sample size reported).

high positive LegalCheck: Retrieval- and Context-Augmented Generation for ... time to produce advice/objection response letters

We outline a research program for the runtime systems that foundation-model software agents will require.

Paper claims to present a forward-looking research agenda or program (stated in abstract); this is a conceptual contribution rather than an empirical finding.

high positive AI Harness Engineering: A Runtime Substrate for Foundation-M... research directions needed for runtime systems for foundation-model software age...

Applied to a controlled validation task, the framework yields episode packages whose evidence structure varies systematically with harness level: lower levels produce only a final patch, while higher levels produce reproduction logs, failure attributions, deterministic requirement checks, and structured verification reports.

Empirical application described in the abstract: framework applied to a controlled validation task showing systematic variation in episode-package evidence structure across harness levels. The abstract does not report sample size or statistical measures.

high positive AI Harness Engineering: A Runtime Substrate for Foundation-M... evidence structure of episode packages produced (types of artifacts: final patch...

We propose a trace-based evaluation protocol that converts each agent run into an auditable episode package.

Methodological proposal described in the abstract proposing a trace-based protocol and an auditable episode package format; no quantitative evaluation details provided in the abstract.

high positive AI Harness Engineering: A Runtime Substrate for Foundation-M... auditability of agent runs (availability of trace-based episode packages)

We operationalize the harness through a four-level ladder (H0–H3) that progressively exposes runtime support to the agent.

Design contribution described in the paper (abstract) introducing a four-level ladder (H0–H3) as an operationalization of the harness concept.

high positive AI Harness Engineering: A Runtime Substrate for Foundation-M... degree of runtime support exposed to an agent across harness levels

Foundation models have transformed automated code generation.

Statement in paper's abstract referring to broad impact of foundation models on automated code generation; likely supported by citations and literature overview within the paper (no sample size or quantitative study reported in the abstract).

high positive AI Harness Engineering: A Runtime Substrate for Foundation-M... ability of foundation models to generate code (automation of coding tasks)

Authorship preservation should be a design priority for AI tools deployed in identity-relevant, behavior-dependent tasks.

Authors' recommendation based on experimental results showing negative motivational and behavioral consequences of delegating authorship to LLMs despite improved objective goal quality.

high positive Optimized but Unowned: How AI-Authored Goals Undermine the M... design recommendation (no empirical outcome measured)

Mediation analyses identified psychological ownership as the mechanism: it mediated the authorship effect on every downstream motivational and behavioral outcome, while objective goal quality did not.

Mediation analyses reported in the preregistered experiment (authors tested psychological ownership and objective goal quality as mediators of authorship effects on multiple downstream outcomes); preregistered N = 470.

high positive Optimized but Unowned: How AI-Authored Goals Undermine the M... mediating effect of psychological ownership on authorship => motivational and be...

At two-week follow-up, 72.8% of self-authored participants had acted on two or more of their goals, compared to 46.6% in the LLM condition.

Behavioral follow-up measure collected two weeks after the intervention in the preregistered experiment; percentages reported in the paper/abstract. (Follow-up completion N not specified in the abstract.)

high positive Optimized but Unowned: How AI-Authored Goals Undermine the M... proportion of participants who acted on two or more goals within two weeks (beha...

LLM-generated goals scored higher on SMART criteria (specificity, measurability, achievability, relevance, and time-boundedness).

Preregistered randomized experiment comparing self-authored vs LLM-authored goals derived from a personal reflection; reported effect size d = 2.26; total preregistered N = 470.

high positive Optimized but Unowned: How AI-Authored Goals Undermine the M... SMART criteria score (objective goal quality)

As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior.

Intervention experiments applying PGLA to model decoding on IMAVB; reported consistent improvements in the models' tendency to reject misleading premises after logit adjustment guided by probes.

high positive Senses Wide Shut: A Representation-Action Gap in Omnimodal L... decision_quality

We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension.

Description of new benchmark introduced in paper: 500 clips, 2x2 design (vision vs audio × standard vs misleading premises); used to measure conflict detection independently of standard multimodal QA.

high positive Senses Wide Shut: A Representation-Action Gap in Omnimodal L... other

The Agent-First paradigm is orthogonal and complementary to transport-layer standards such as MCP, operating as the semantic application layer above existing tool discovery and invocation protocols.

Conceptual argument and mapping presented in the paper asserting interoperability/orthogonality with transport-layer standards (e.g., MCP).

high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... compatibility_with_transport_layer_standards

Agent-First APIs improve autonomous error recovery by 5.8x (compared to optimized CRUD baselines).

Reported comparative experiments on 50 real operational tasks measuring autonomous error recovery capability.

high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... autonomous_error_recovery

Agent-First APIs reduce required human interventions by 72.7% (compared to optimized CRUD baselines).

Same set of comparative experiments on 50 real operational tasks reported in the paper.

high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... required_human_interventions

Comparative experiments on 50 real operational tasks demonstrate that Agent-First APIs achieve 88% end-to-end task success rate versus 64% for optimized CRUD baselines (+37.5%).

Empirical comparative experiments reported in the paper on 50 real operational tasks, comparing Agent-First APIs to optimized CRUD baselines.

high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... end-to-end_task_success_rate

The paradigm is implemented and validated in a production multi-tenant SaaS platform serving 85 registered tools across 6 business domains.

Reported production implementation and deployment statistics (platform with 85 registered tools spanning 6 business domains).

high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... deployment_of_paradigm_on_production_SaaS_platform

We propose the Agent-First Tool API paradigm, comprising three integrated mechanisms: (1) a Six-Verb Semantic Protocol that decomposes tool interactions into search, resolve, preview, execute, verify, and recover phases; (2) a Normalized Tool Contract (NTC) providing structured decision-support metadata including confidence scores, evidence chains, and suggested next actions; and (3) a dual-layer governance pipeline combining static capability policies with dynamic risk escalation.

Design and specification presented in the paper (proposed architecture and components).

high positive Agent-First Tool API: A Semantic Interface Paradigm for Ente... proposed_API_paradigm_and_components

LLMs can help generate more correct and functional code compared to participant-generated solutions.

Comparative analysis of generated solutions reported in the paper (no sample-size for solutions explicitly stated in the abstract). The paper states LLM-assisted solutions were more correct/functional.

high positive "Like Taking the Path of Least Resistance": Exploring the Im... correctness and functionality of generated code

Qualitative analysis of participants' interactions and interviews revealed four different human-LLM collaboration modes supporting various problem-solving strategies.

Qualitative analysis of interaction logs and retrospective interviews from the study participants (N=20) reported in the paper; identification of four collaboration modes described.

high positive "Like Taking the Path of Least Resistance": Exploring the Im... types of collaboration modes

We conducted a within-subject study followed by retrospective interviews with programmers (N=20).

Stated methods in the paper: within-subject experimental design plus retrospective interviews; sample size explicitly given as N=20.

high positive "Like Taking the Path of Least Resistance": Exploring the Im... study_design_and_sample

Organizations classified as 'Proactive Integrators' can reduce the risk of obsolescence by up to 53%.

Subgroup finding reported in the study (reduction estimate for organizations labeled 'Proactive Integrators'); specific subgroup sample not provided in abstract.

high positive The AI-engineering imperative - Navigating synergy and obsol... reduction in risk of skills obsolescence

AI-assisted engineering teams can achieve a 24% increase in productivity.

Empirical finding reported by the study, derived from the mixed-methods analysis (survey of 320 orgs, Delphi with 40 experts, and case studies of 5 industries as described in abstract).

high positive The AI-engineering imperative - Navigating synergy and obsol... increase in productivity of AI-assisted engineering teams

Entities that strategically implement AI can enhance their innovation cycles by up to 30%.

Statement in paper (presented as a forecast/estimate; no specific study or sample detailed in abstract).

high positive The AI-engineering imperative - Navigating synergy and obsol... improvement in innovation cycle speed/efficiency

Frontier directions include differentiable token budgets and dynamic markets to lay the theoretical foundation for scalable next-generation agent systems.

Paper's conclusion/recommendations based on surveyed literature and identified gaps; presented as proposed future research directions rather than empirically validated findings.

high positive Token Economics for LLM Agents: A Dual-View Study from Compu... proposal of differentiable token budgets and dynamic markets as key research fro...

Security: Internalizing adversarial threats as endogenous economic constraints.

Authors argue for modeling adversarial threats within the economic/tokens framework as endogenous constraints; conceptual/theoretical claim from the survey.

high positive Token Economics for LLM Agents: A Dual-View Study from Compu... treatment of adversarial threats as endogenous constraints in token economics mo...

Macro-level (Agent Ecosystems): Addressing congestion externalities and pricing via mechanism design.

Paper posits mechanism-design approaches to tackle congestion externalities and pricing in agent ecosystems; conceptual proposal based on economic theory and literature synthesis.

high positive Token Economics for LLM Agents: A Dual-View Study from Compu... mitigation of congestion externalities and improved pricing in agent ecosystems

Meso-level (Multi-Agent Systems): Minimizing collaboration friction using transaction cost and principal-agent theories.

Authors propose applying transaction-cost and principal-agent frameworks to multi-agent token interactions; presented as a theoretical taxonomy/synthesis without reported empirical sample.

high positive Token Economics for LLM Agents: A Dual-View Study from Compu... reduction of collaboration friction in multi-agent systems through economic-theo...

Micro-level (Single Agent): Optimizing budget-constrained factor substitution via neoclassical firm theory.

The paper asserts a micro-level taxonomy using neoclassical firm theory to model single-agent token-budget optimization; presented as conceptual/theoretical mapping rather than empirical test.

high positive Token Economics for LLM Agents: A Dual-View Study from Compu... ability to optimize budget-constrained factor substitution at single-agent level

We conceptualize tokens as production factors, exchange mediums, and units of account.

Paper provides a conceptual taxonomy framing tokens in three economic roles; based on theoretical argumentation and literature synthesis.

high positive Token Economics for LLM Agents: A Dual-View Study from Compu... conceptual framing of tokens into three economic roles

« Prev 1 2 3 … 118 119 120 … 276 277 Next »