Evidence (13827 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	195	97	889	1979
Governance & Regulation	815	391	188	121	1539
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	624	233	123	96	1084
Research Productivity	410	121	56	331	929
Output Quality	466	177	59	47	749
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	166	122	24	495
Task Allocation	206	64	70	31	376
Skill Acquisition	165	57	60	17	299
Innovation Output	201	27	41	18	288
Employment Level	105	51	107	13	278
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	149	46	26	3	224
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	61	20	12	182
Error Rate	69	91	10	2	172
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	92	19	13	19	145
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Skill Obsolescence	5	45	6	1	57
Creative Output	31	16	7	2	57
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

This survey presents the first comprehensive survey of Token Economics.

Author claim of novelty in the paper (self-declared 'first comprehensive survey'); based on the authors' scope and coverage comparison to prior literature as described in the manuscript.

high positive Token Economics for LLM Agents: A Dual-View Study from Compu... comprehensiveness and novelty of the survey

Tokens have emerged as the core economic primitives of Agentic AI.

Author assertion in the paper's introduction/abstract; supported by conceptual synthesis of agentic AI literature (survey/mapping rather than original empirical data).

high positive Token Economics for LLM Agents: A Dual-View Study from Compu... recognition of tokens as core economic primitives in agentic AI

AwareLLM opens new avenues for Human-AI collaboration where technology adapts to users' needs rather than users adhering to technological constraints.

Authorial/conceptual claim based on the proposed framework and study results; presented as a broader implication rather than a direct empirical finding.

high positive AwareLLM: A Proactive Multimodal Ecosystem for Personalized ... human-AI collaboration potential

Participants described AwareLLM's personalized interventions as timely and relevant, helping them boost their confidence and deepen engagement with their work.

Qualitative user feedback reported in the study (participant descriptions); sample size 20. No coding details or counts provided in the abstract.

high positive AwareLLM: A Proactive Multimodal Ecosystem for Personalized ... confidence and engagement (subjective reports)

AwareLLM reduced mental demand for participants.

Reported results from the user study (comparison to a standard LLM assistant) with 20 participants; abstract reports reductions but gives no quantitative metrics.

high positive AwareLLM: A Proactive Multimodal Ecosystem for Personalized ... mental demand

AwareLLM led to reductions in cognitive fatigue.

Reported results from the user study comparing AwareLLM to a standard LLM assistant; sample size 20. No quantitative values provided in the abstract.

high positive AwareLLM: A Proactive Multimodal Ecosystem for Personalized ... cognitive fatigue

Compared to a standard LLM assistant, AwareLLM produced statistically significant improvements in task performance.

Results reported from the user study (comparison vs. standard LLM); sample size noted as 20 participants. No numerical effect size provided in the abstract.

high positive AwareLLM: A Proactive Multimodal Ecosystem for Personalized ... task performance

AwareLLM dynamically adapts to users' psychophysiological states while analyzing temporal patterns and behavioral tendencies to provide personalized and timely interventions.

Design and claimed operational behavior of the proposed framework as described by authors.

high positive AwareLLM: A Proactive Multimodal Ecosystem for Personalized ... personalization/adaptivity of interventions

We introduce AwareLLM, a multimodal framework that integrates egocentric vision, pupillometry, eye-gaze tracking, posture detection, heart activity, and large language models to create a proactive and context-aware ecosystem.

System/methods description in paper (architecture/design claim).

high positive AwareLLM: A Proactive Multimodal Ecosystem for Personalized ... system capability to combine multimodal signals

Information workers' productivity is significantly influenced by their cognitive states and physiological responses.

Background statement in paper (literature-motivated claim); no study data provided within the abstract to support it.

high positive AwareLLM: A Proactive Multimodal Ecosystem for Personalized ... productivity

The research contributes a novel sociotechnical architecture class that integrates intent interpretation, schema formalization, and supervised agentic decision support, offering a scalable pathway for inclusive AI-driven enterprise transformation.

Concluding/novelty claim presented by the authors describing the paper's contribution; scalability and inclusiveness are asserted conceptually without empirical scaling or adoption evidence in the excerpt.

high positive From Configuration to Cognition: A Self-Configuring Agentic ... scalability and inclusiveness of AI-driven enterprise transformation via the pro...

The architecture incorporates adaptive agentic orchestration and Cognitive Infrastructure Elasticity, enabling dynamic policy adjustment under demand volatility while preserving human-supervisory governance.

Architectural/design claim in the paper describing system capabilities; no experimental or empirical validation provided in the excerpt.

high positive From Configuration to Cognition: A Self-Configuring Agentic ... capacity for dynamic policy adjustment under demand volatility and preservation ...

The framework operationalizes Intent-to-Schema automation, translating natural-language business intent into structured operational models and reducing configuration debt embedded in traditional metadata-driven systems.

Described as a functionality of the proposed framework in the paper (conceptual/technical claim); no quantitative evaluation or measured reduction of 'configuration debt' reported in the excerpt.

high positive From Configuration to Cognition: A Self-Configuring Agentic ... translation of natural-language intent to operational schemas and reduction of c...

This paper introduces a Self-Configuring Agentic CRM (SC-ACRM) architecture designed to eliminate configuration barriers in micro-retail contexts.

Architectural proposal and description presented in the paper (design-level contribution); no field deployment or empirical validation reported in the excerpt.

high positive From Configuration to Cognition: A Self-Configuring Agentic ... elimination/reduction of configuration barriers for micro-retail CRM

Artificial intelligence (AI) has significantly enhanced enterprise-scale customer relationship management (CRM) systems.

Stated as background/claim in the paper's introduction; no empirical data, sample size, or citations provided in the excerpt.

high positive From Configuration to Cognition: A Self-Configuring Agentic ... enhancement of enterprise-scale CRM systems

An AI Workflow Store of hardened and reusable workflows would allow agents to invoke workflows with far greater reliability and security than improvised tool chains.

Vision/proposal in the paper advocating an AI Workflow Store as a solution; presented conceptually without experimental or deployment evidence.

high positive Engineering Robustness into Personal Agents with the AI Work... reliability and security of agent-invoked workflows

Integrating rigorous software engineering processes into the agentic loop will produce production-grade, hardened, and deterministically-constrained agent workflows that substantially outperform brittle on-the-fly synthesis.

Prescriptive claim / proposed hypothesis in the paper advocating integration of SE practices into agent workflows; offered as a reasoned proposal without empirical results.

high positive Engineering Robustness into Personal Agents with the AI Work... workflow reliability/security and overall performance compared to on-the-fly syn...

Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation.

Methodological description: seed-driven architecture and simulated API failures; claimed as a distinguishing design feature versus prior datasets.

high positive ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdepend... determinism and diversity of environment states / simulated API failure scenario...

ComplexMCP provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems.

Benchmark construction details reported in the paper: >300 tools, 7 stateful sandboxes (explicit counts provided).

high positive ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdepend... number of tools / sandboxes included in the benchmark

We introduce ComplexMCP, a benchmark designed to evaluate agents in rigorous conditions built on the Model Context Protocol (MCP).

Design and construction of the benchmark reported by authors; methodological description (benchmark/tooling claim).

high positive ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdepend... availability of a benchmark implementing MCP for complex, stateful tool evaluati...

There is a 15%–22% wage premium for workers demonstrating AI-augmentation capabilities.

Reported range across synthesized empirical studies documenting wage differences associated with demonstrated AI-augmentation capabilities.

high positive Creation, validation, obsolescence: observed evidence of AI-... wage premium for workers demonstrating AI-augmentation capabilities

The study draws policy implications for EU Cohesion programming and Sustainable Development Goals 4, 8, 9, 10, and 17.

Paper explicitly states policy implications and links to specific SDGs in its conclusions.

high positive Artificial Intelligence, Social Capital, and Sustainable Emp... policy_relevance_to_SDGs_and_cohesion_programming

External technology partnerships, targeted education, and economic incentives operate as enablers [of AI adoption], all mediated by social and human capital availability.

Thematic analysis of interview data identifying these factors as enabling AI adoption, with mediation by social/human capital.

high positive Artificial Intelligence, Social Capital, and Sustainable Emp... enablers_of_AI_adoption

The socially optimal adoption speed and retraining capacity are complements: stronger institutions (larger retraining capacity) raise the optimal adoption speed.

Comparative-static result from the social-planner optimization in the dynamic model showing positive cross-partial effect between retraining capacity and optimal adoption speed.

high positive Too Fast to Adjust: Adoption Speed and the Permanent Cost of... optimal adoption speed as a function of retraining capacity / institutional stre...

Faster adoption produces a larger discouraged stock.

Analytical comparative-static result from the dynamic model linking adoption speed to the size of the discouraged (permanently exited) worker stock.

high positive Too Fast to Adjust: Adoption Speed and the Permanent Cost of... discouraged stock (count of permanently exited workers)

Faster AI adoption compresses the displacement window without reducing total displacement.

Analytical result from a dynamic theoretical model in which displaced routine workers enter a retraining pipeline with finite capacity (model derivation and comparative statics). No empirical sample reported.

high positive Too Fast to Adjust: Adoption Speed and the Permanent Cost of... displacement window length / total displacement

We evaluate five defences; read-only access control eliminates the direct mutation vector, while the remaining four are partial and model-dependent.

Defence evaluation experiments reported by authors across five mitigation strategies; read-only access control reported to eliminate direct mutation; other four provide partial, model-dependent protection.

high positive Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise A... effectiveness of mitigation strategies in preventing or limiting Oracle Poisonin...

Team-based ventures are increasingly dominant in the top tiers of platform rankings.

Ranking-tier analysis in the Product Hunt dataset showing an increasing share of team-founded launches among top-tier (highest-ranked) products over the study period.

high positive Generative AI Fuels Solo Entrepreneurship, but Teams Still L... share of team-founded ventures among top-tier (highest-ranked) launches

The increase in entrepreneurial entry was driven disproportionately by solo entrepreneurs.

Same Product Hunt dataset (>160,000 launches) with analysis of launch ownership structure showing a larger post-release increase in launches by solo founders relative to teams.

high positive Generative AI Fuels Solo Entrepreneurship, but Teams Still L... share or count of launches by solo entrepreneurs

Entrepreneurial entry increased sharply following the public release of ChatGPT-3.5.

Analysis of over 160,000 product launches on Product Hunt comparing entry rates before and after the public release of ChatGPT-3.5 (event-study / pre-post comparison across the platform).

high positive Generative AI Fuels Solo Entrepreneurship, but Teams Still L... entrepreneurial entry (count of product launches)

TourMart outputs a sentence a compliance report can quote: 'at this deployment, 7.7 extra commission-steered recommendations per 100 paired traveler sessions.'

Reported output/example from TourMart tool summarizing the measured steering effect (rounded result based on experimental measurement).

high positive TourMart: A Parametric Audit Instrument for Commission Steer... commission-steered recommendations per 100 paired traveler sessions (tool-genera...

An extended-n supplement (n=270) confirms significance for Llama-3.1-8B (+2.96pp, p=0.008).

Larger-sample experimental replication/extension reported in the paper with n=270 and a p-value (p=0.008).

high positive TourMart: A Parametric Audit Instrument for Commission Steer... commission-steered recommendations (percentage-point difference between prompts)

A Llama-3.1-8B reader shows +3.50pp steering in the same direction at n=143 (initial test).

Empirical experiment using TourMart with Llama-3.1-8B at same/deployed settings; sample size explicitly reported as n=143 for this test.

high positive TourMart: A Parametric Audit Instrument for Commission Steer... commission-steered recommendations (percentage-point difference between prompts)

At deployed (lambda=1, kappa=0.05), a Qwen-14B reader shows +7.69pp steering (exact McNemar p=0.003).

Empirical experiment using TourMart at specified governance settings (lambda=1, kappa=0.05) comparing commission-aware vs. minimum-disclosure prompts; statistical test reported (exact McNemar). Sample size not stated in excerpt.

high positive TourMart: A Parametric Audit Instrument for Commission Steer... commission-steered recommendations (percentage-point difference in acceptance be...

Each booking earns the OTA commission and different suppliers pay different rates: the agent has a structural incentive to favor higher-margin recommendations.

Theoretical/structural argument in paper based on commission heterogeneity and revenue incentives; not an experimental measurement in excerpt.

high positive TourMart: A Parametric Audit Instrument for Commission Steer... incentive to favor higher-margin supplier recommendations

BenchCAD positions itself as a benchmark for measuring and improving the industrial readiness of multimodal CAD automation.

Authors' stated goal/purpose in the paper/abstract describing BenchCAD as a benchmark intended to measure and guide improvements towards industrial readiness.

high positive BenchCAD: A Comprehensive, Industry-Standard Benchmark for P... benchmark_intended_impact_on_industrial_readiness

Industrial CAD code generation requires models to produce executable parametric programs from visual or textual inputs and to understand 3D structure, infer engineering parameters, and choose CAD operations that reflect design and manufacture.

Problem definition and motivation provided by the authors in the paper/abstract describing the necessary capabilities for industrial CAD code generation.

high positive BenchCAD: A Comprehensive, Industry-Standard Benchmark for P... required_capabilities_for_task

BenchCAD enables fine-grained analysis across perception, parametric abstraction, and executable program synthesis.

Authors' description of benchmark scope and tasks designed to probe perception (visual understanding), parametric abstraction (inferring engineering parameters), and executable program synthesis (generating runnable CadQuery programs).

high positive BenchCAD: A Comprehensive, Industry-Standard Benchmark for P... analysis_capability_of_benchmark

BenchCAD evaluates models through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing.

Benchmark design described in the paper/abstract listing four evaluation tasks (VQA, code QA, image-to-code, instruction-guided code editing).

high positive BenchCAD: A Comprehensive, Industry-Standard Benchmark for P... evaluation_task_coverage

BenchCAD contains 17,900 execution-verified CadQuery programs across 106 industrial part families.

Dataset construction reported in the paper/abstract: explicit statement of 17,900 execution-verified CadQuery programs spanning 106 industrial part families (e.g., bevel gears, compression springs, twist drills).

high positive BenchCAD: A Comprehensive, Industry-Standard Benchmark for P... presence_and_scope_of_dataset

Alternatives to one-size-fits-all chatbots—such as pluralistic system design, task-specific tools, and institutional safeguards—would better mitigate social and economic harm.

Prescriptive recommendations based on the paper's analysis; not supported by empirical trials or quantified evaluations within the paper.

high positive What if AI systems weren't chatbots? Effectiveness of pluralistic design, task-specific tools, and institutional safe...

Verification Coverage, a six-component reportable standard with a minimum-composition rule, should sit beside capability scores in model cards, leaderboards, and regulatory disclosures.

Author-proposed metric/standard introduced in the paper as a policy/tool recommendation.

high positive The Open-Box Fallacy: Why AI Deployment Needs a Calibrated V... inclusion of 'Verification Coverage' standard alongside capability scores in rep...

The gate to deploy should be 'calibrated verification': authorization should be domain-scoped, independently checkable, monitored after release, accountable, contestable, and revocable.

Normative proposal by the authors (prescriptive recommendation presented in the paper).

high positive The Open-Box Fallacy: Why AI Deployment Needs a Calibrated V... recommended features of deployment authorization regime

Model capability is uneven across nearby tasks, so authorization must attach to a specific use rather than to a model in general.

Author claim supported by the conceptual point that model capabilities vary across tasks; used as an argument for use-specific authorization.

high positive The Open-Box Fallacy: Why AI Deployment Needs a Calibrated V... appropriateness of use-scoped authorization vs model-wide authorization

AI is the most important predictive factor for Lae (based on artificial neural network analysis).

Artificial neural network (ANN) predictive modeling on composite indices for AI and Lae using panel data from 2012–2022 across 30 provincial regions; variable importance ranking from ANN indicates AI as top predictor.

high positive A study of the impact of artificial intelligence on the low-... predictive importance for Lae

An exogenous shock test using the Big Data Pilot Zone policy further confirms the robustness of the AI–Lae relationship findings.

Policy shock (Big Data Pilot Zone) robustness test performed on the same panel of 30 provincial regions (2012–2022); described as an exogenous shock test corroborating the main results.

high positive A study of the impact of artificial intelligence on the low-... robustness of AI's effect on Lae

Regression results show a positive relationship between firm performance and breadth of AI integration.

Multivariate regression analysis reported in the paper using BTOS AI supplement data (Nov 2025–Jan 2026); association between firm performance (dependent variable) and measures of AI integration breadth (independent variables); sample size and controls not included in excerpt.

high positive The Microstructure of AI Diffusion: Evidence from Firms, Bus... firm performance (as related to AI integration breadth)

Most firms (66%) use AI for task augmentation rather than replacement.

Survey responses about intent/role of AI within firms from the BTOS AI supplement (Nov 2025–Jan 2026); descriptive percent reporting augmentation vs replacement; sample size not provided.

high positive The Microstructure of AI Diffusion: Evidence from Firms, Bus... reported primary role of AI (task augmentation vs replacement)

Worker-level AI use appears in 23% of firms (41%, employment-weighted), primarily for writing, document analysis, and information search.

Firm-reported presence of worker-task AI use from the BTOS AI supplement (Nov 2025–Jan 2026); descriptive percentages given, employment-weighted alternative reported; sample size not provided in excerpt.

high positive The Microstructure of AI Diffusion: Evidence from Firms, Bus... presence of worker-level AI use within firms and primary tasks where used

Among adopter firms, AI is most often used in Sales and Marketing (52%), Strategy (45%), and IT (41%).

Function-specific adoption rates reported from the BTOS AI supplement descriptive statistics (Nov 2025–Jan 2026); sample restricted to adopter firms; sample sizes not stated.

high positive The Microstructure of AI Diffusion: Evidence from Firms, Bus... functional deployment proportions (Sales & Marketing, Strategy, IT)

« Prev 1 2 3 … 119 120 121 … 276 277 Next »