Evidence (8570 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Adoption Remove filter

In multi-agent configurations the semantic training gap produces a compounding failure mode termed 'semantic drift'.

Analytical description and demonstration in the paper describing multi-agent interactions and observed/argued compounding failures (conceptual demonstration; no numeric sample stated).

high negative The Semantic Training Gap: Ontology-Grounded Tool Architectu... occurrence of semantic drift (compounding errors in multi-agent setups)

The semantic training gap causes operationally incorrect outputs even when model responses are linguistically precise.

Demonstrations and examples reported in the paper showing cases where model outputs are linguistically fluent but operationally incorrect; supported by the paper's analysis and experimental illustrations (no numeric sample provided for this general claim).

high negative The Semantic Training Gap: Ontology-Grounded Tool Architectu... operational correctness of outputs (vs. linguistic precision)

There exists a 'semantic training gap': a structural disconnect between how AI systems acquire domain vocabulary through training and how manufacturing operations define meaning through ontological relationships.

Paper provides a formalization and conceptual framing of the gap (theoretical description and argumentation within the manuscript).

high negative The Semantic Training Gap: Ontology-Grounded Tool Architectu... existence of semantic training gap (structural disconnect)

LLM-based AI agents deployed in manufacturing demonstrate statistical fluency with domain terminology but lack grounded understanding of operational semantics.

Stated assertion in the paper describing observed behavior of deployed LLM agents; supported by conceptual analysis and examples/demonstrations reported in the paper (no numeric sample size given).

high negative The Semantic Training Gap: Ontology-Grounded Tool Architectu... grounded understanding of operational semantics

The gap is prompt-resistant across seven variants.

Experiments applying seven different prompt variants to the evaluated models on IMAVB showing that the representation-action mismatch and failure modes persist despite prompt changes.

high negative Senses Wide Shut: A Representation-Action Gap in Omnimodal L... decision_quality

The gap is modality-asymmetric (audio grounding underperforms vision).

Within IMAVB's 2x2 design (vision vs audio), comparative performance metrics indicate worse grounding/rejection behavior for audio-targeted conditions versus vision-targeted conditions across evaluated models.

high negative Senses Wide Shut: A Representation-Action Gap in Omnimodal L... decision_quality

Behaviorally, models fall into two failure modes: under-rejection, in which they answer misleading questions as if the false premise were true; and over-rejection, in which they reject more often but also reject standard questions, sacrificing ordinary comprehension accuracy.

Behavioral results on IMAVB showing distinct response patterns across tested models: some rarely reject misleading premises (under-rejection) while others reject too often including correct/standard questions (over-rejection), measured across the 500-clip benchmark.

high negative Senses Wide Shut: A Representation-Action Gap in Omnimodal L... error_rate

Across eight open-source omnimodal LLMs and Gemini 3.1 Pro, we document a Representation-Action Gap: hidden states reliably encode premise–perception mismatches even when the same models almost never reject the false claim in their outputs.

Empirical evaluation on IMAVB across 9 models (8 open-source + Gemini 3.1 Pro); internal probing of hidden states showing mismatch signal and behavioral output analysis showing low rejection rates for false premises.

high negative Senses Wide Shut: A Representation-Action Gap in Omnimodal L... decision_quality

The paper identifies five fundamental architectural mismatches between conventional APIs and autonomous agent requirements: exact-identifier dependence, rendering-oriented responses, single-shot interaction assumptions, user-equivalent authorization, and opaque error semantics.

Conceptual analysis and problem-framing presented in the paper (qualitative identification of five mismatch categories).

high negative Agent-First Tool API: A Semantic Interface Paradigm for Ente... architectural_mismatches_between_conventional_APIs_and_autonomous_agent_requirem...

Small-scale retail businesses remain structurally excluded from these advancements due to configuration complexity, technical overhead, and limited digital capabilities.

Asserted as a problem statement in the paper; no empirical evidence, sample size, or quantitative analysis provided in the excerpt.

high negative From Configuration to Cognition: A Self-Configuring Agentic ... exclusion from AI-enhanced CRM adoption

Producing hardened, production-grade agent workflows may require extra compute and time, and these costs must be amortized through reuse across a broad user community.

Argument in paper reasoning that added rigor entails higher compute/time costs and that reuse across users is needed to amortize these costs; no empirical cost estimates provided.

high negative Engineering Robustness into Personal Agents with the AI Work... resource_costs (compute/time) and implications for amortization/adoption

By focusing on rapid, real-time synthesis, AI agents are effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them.

Conceptual argument presented in the paper asserting a qualitative mismatch between on-the-fly agents and high-stakes production needs; no empirical validation reported.

high negative Engineering Robustness into Personal Agents with the AI Work... suitability for high-stakes use / risk to users

The on-the-fly paradigm short-circuits disciplined software engineering processes—iterative design, rigorous testing, adversarial evaluation, staged deployment, and more—that have delivered relatively reliable and secure systems.

Argumentative claim in paper linking the on-the-fly loop to reduced application of standard SE processes; no empirical study, sample, or quantitative evidence provided.

high negative Engineering Robustness into Personal Agents with the AI Work... reliability and security (degree to which SE processes are applied)

Credential erosion is evident in the aggregate pattern (credentials losing signaling value relative to AI-augmented skill demonstrations).

Synthesis statement from included studies noting credential erosion alongside skill signaling changes; not quantified in the excerpt.

high negative Creation, validation, obsolescence: observed evidence of AI-... credential value / credential signaling (erosion)

Developing economies reliant on cognitive services outsourcing face disproportionate disruption through both direct exposure and indirect demand-erosion channels.

Preliminary empirical evidence across included studies indicating larger negative impacts for economies dependent on cognitive-services exports; described as preliminary but material.

high negative Creation, validation, obsolescence: observed evidence of AI-... disruption to employment/demand in developing economies reliant on cognitive ser...

Observable labor market data already document patterns consistent with AI-driven displacement rather than mere transformation—concentrated among routine cognitive tasks and junior roles.

Synthesis of observed labor market indicators from retained empirical studies since 2020 showing concentration of declines in routine cognitive tasks and junior roles.

high negative Creation, validation, obsolescence: observed evidence of AI-... concentration of job losses/displacement among routine cognitive tasks and junio...

Evidence from online labor markets shows a 2%–21% reduction in posting volumes for automatable creative tasks following ChatGPT's release.

Empirical analyses of online labor market posting volumes reported in multiple studies included in the review; range reported across studies.

high negative Creation, validation, obsolescence: observed evidence of AI-... posting volumes for automatable creative tasks on online labor markets

Across synthesized studies, there was a 14–41% reduction in postings for entry- and mid-level software development and content-creation roles in high-income economies between 2022 and 2024 (range across individual studies: −14% to −41%; median: −23%).

Synthesis of empirical studies retained in the systematic review (numerical range and median reported across non-overlapping study designs and geographies); no pooled meta-analytic estimate provided.

high negative Creation, validation, obsolescence: observed evidence of AI-... job postings for entry- and mid-level software development and content-creation ...

Without parallel investment in digital literacy, organizational culture, and inter-firm networks, AI will reproduce rather than reduce employment inequalities.

Authors' conclusion drawn from thematic analysis of interviews and conceptual framing; predictive statement based on qualitative findings.

high negative Artificial Intelligence, Social Capital, and Sustainable Emp... employment_inequalities

AI adoption in peripheral economies is not a purely technological or financial challenge but a social and human capital challenge, embedded in a biocultural environment shaped by brain drain, institutional thinness, and weak civic intermediation.

Synthesis of interview findings using Bitsani's Biocultural City framework; qualitative evidence from 12 interviews supports this argument.

high negative Artificial Intelligence, Social Capital, and Sustainable Emp... nature_of_challenges_to_AI_adoption

Knowledge deficits and financial constraints emerge as primary barriers [to AI adoption].

Thematic analysis of the twelve semi-structured interviews reporting these themes as primary barriers.

high negative Artificial Intelligence, Social Capital, and Sustainable Emp... barriers_to_AI_adoption

A controlled delivery-mode comparison shows that inline evaluation produces false negatives: GPT-5.1 shows 0% trust inline but 100% under both simulated and real agentic tool-use, demonstrating that delivery mode is a first-order confound.

Controlled experiments comparing inline evaluation vs simulated and real agentic tool-use on GPT-5.1; reported 0% trust in inline mode vs 100% trust in agentic modes (authors' reported results).

high negative Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise A... GPT-5.1 trust rate depending on delivery mode (inline vs agentic tool-use)

Every tested model trusts poisoned data at 100% at moderate attacker sophistication (L2), with 269 valid trials (of 270) accepting fabricated security claims under directed queries.

Primary experimental results across 270 directed-query trials (9 models × 30 each); authors report 269 of 270 trials accepted fabricated security claims under attacker sophistication level L2.

high negative Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise A... rate at which models accept fabricated security claims when querying poisoned gr...

We demonstrate six attack scenarios against a production 42-million-node code knowledge graph, providing the first empirical demonstration of knowledge graph poisoning against a production-scale agentic system.

Empirical demonstrations described in paper: six distinct attack scenarios executed against a production knowledge graph containing 42 million nodes (authors' reported experimental setup).

high negative Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise A... successful execution of poisoning attacks on a production-scale knowledge graph

We define Oracle Poisoning, an attack class in which an adversary corrupts a structured knowledge graph that AI agents query at runtime via tool-use protocols, causing incorrect conclusions through correct reasoning.

Conceptual definition presented by the authors in the paper (theoretical framing and distinction from prompt injection).

high negative Oracle Poisoning: Corrupting Knowledge Graphs to Weaponise A... agent reasoning correctness when querying corrupted knowledge graphs

Disclosure banners, conversion A/B testing, UI dark-pattern taxonomies, and generic LLM safety scores were built for older interfaces and miss the prose-recommendation surface where the steering happens.

Argument in paper that existing governance/audit tools designed for ranked-list or older UIs do not cover the new single-sentence prose-recommendation surface; no empirical test reported in excerpt.

high negative TourMart: A Parametric Audit Instrument for Commission Steer... coverage/effectiveness of existing governance tools for prose recommendations

Current AI development trajectory reflects value choices that prioritize conversational generality over domain specificity, accountability, and long-term social sustainability.

Normative/critical analysis in the paper highlighting design priorities and trade-offs; no empirical measurement provided.

high negative What if AI systems weren't chatbots? Relative prioritization of conversational generality versus domain specificity, ...

Sustained investment in large-scale chatbot infrastructures increases environmental costs.

Paper asserts environmental impacts from infrastructure investment (energy, resource use) as part of systemic critique; no quantified environmental measurements or sample size reported.

high negative What if AI systems weren't chatbots? Environmental costs associated with energy/resource use of chatbot infrastructur...

Chatbot-driven AI development contributes to concentration of economic power.

Argumentation about industry dynamics and infrastructure centralization in the paper; no empirical market-concentration metrics or sample provided.

high negative What if AI systems weren't chatbots? Concentration of economic power among firms/platforms producing and hosting chat...

The normalization of chatbots contributes to labor displacement.

Theoretical argument linking widespread chatbot adoption to changes in work and employment; no empirical displacement estimates provided.

high negative What if AI systems weren't chatbots? Labor displacement (job losses attributable to chatbot adoption)

Normalization of chatbot-mediated interaction alters patterns of work, learning, and decision-making, contributing to deskilling, homogenization of knowledge, and shifting expectations of expertise.

Analytical reasoning and literature-informed claims in the paper; no quantitative measurement or sample reported.

high negative What if AI systems weren't chatbots? Levels of skill retention/ acquisition (deskilling), diversity of knowledge (hom...

Chatbot-based systems often fail to adequately meet user needs, particularly in complex or high-stakes contexts, while projecting confidence and authority.

Qualitative argumentation and illustrative examples in the paper; no reported controlled empirical study or sample size.

high negative What if AI systems weren't chatbots? Adequacy of chatbot responses to user needs in complex/high-stakes contexts and ...

The chatbot paradigm is not a neutral interface choice, but a dominant sociotechnical configuration whose widespread adoption reshapes social, economic, legal, and environmental systems.

Conceptual argument and synthesis in the paper (theoretical analysis); no empirical sample or quantitative data reported.

high negative What if AI systems weren't chatbots? Degree to which chatbot adoption reshapes social, economic, legal, and environme...

This reliance frequently leads to an excessive reliance on mechanistic interpretability to address a deployment challenge beyond its intended scope.

Author argument drawing on conceptual critique and cited empirical distinctions (paper's argumentative content).

high negative The Open-Box Fallacy: Why AI Deployment Needs a Calibrated V... appropriateness of mechanistic interpretability as a gate for deployment

AI deployment in sensitive domains (health care, credit, employment, criminal justice) is often treated as unsafe to authorize until model internals can be explained.

Author assertion based on observed regulatory and institutional tendencies described in the paper (argumentative / contextual evidence within the paper).

high negative The Open-Box Fallacy: Why AI Deployment Needs a Calibrated V... authorization policy stance toward AI in sensitive domains (requirement for inte...

A scoping review found that only 9.0% of FDA-approved AI/ML device documents contained a prospective post-market surveillance study.

Paper references a scoping review that examined FDA-approved AI/ML device documents and reported the 9.0% figure.

high negative The Open-Box Fallacy: Why AI Deployment Needs a Calibrated V... presence of prospective post-market surveillance study in FDA AI/ML device docum...

A 53-percentage-point gap between internal representations and output correction shows that understanding may not translate into action.

Paper cites a recent empirical finding reporting a 53 percentage-point gap between models' internal representations and their ability to correct outputs (described as 'recent evidence').

high negative The Open-Box Fallacy: Why AI Deployment Needs a Calibrated V... gap between internal model representations and ability to correct outputs

Functional deployment and operational investment in AI are associated with employment declines.

Regression analyses from the BTOS AI supplement linking measures of functional AI deployment and operational AI investment to firm-reported employment changes; observational associations (sample size and exact model specification not shown in excerpt).

high negative The Microstructure of AI Diffusion: Evidence from Firms, Bus... employment change associated with functional deployment and operational investme...

Employment reductions attributable to AI are rare: only 2% of firms report employment reductions.

Firm self-reports on employment outcomes related to AI from the BTOS AI supplement (Nov 2025–Jan 2026); descriptive statistic reported; sample size not excerpted.

high negative The Microstructure of AI Diffusion: Evidence from Firms, Bus... reported employment reductions due to AI

Among firms with worker-level AI use, 65% restrict use to three or fewer tasks.

Descriptive statistic from BTOS AI supplement giving distribution of number of worker tasks using AI among firms that report worker-level use; sample size not shown.

high negative The Microstructure of AI Diffusion: Evidence from Firms, Bus... breadth of worker-task AI use per firm (number of tasks)

Among adopter firms, scope remains limited: 57% use AI in three or fewer functions.

Descriptive distribution of number of business functions using AI among adopter firms in the BTOS AI supplement (Nov 2025–Jan 2026); sample restricted to adopter firms (sample size not provided).

high negative The Microstructure of AI Diffusion: Evidence from Firms, Bus... number of business functions using AI per adopting firm (breadth of functional d...

In labor-intensive industries, industrial robots shorten the backward linkage length (i.e., they reduce backward linkage length in labor-intensive sub-sectors).

Heterogeneity analysis in the paper comparing effects across labor-intensive sub-sectors within the panel of 14 manufacturing sub-sectors; reported finding of a negative effect on backward linkage length in labor-intensive industries.

high negative Research on the impact of industrial robot application on th... backward linkage length (a component of global value chain length) in labor-inte...

Institutional inertia in property valuation poses risks to asset pricing, collateral risk modelling and investor confidence.

Analytical inference from interview findings and theoretical synthesis highlighting implications for property investment and financial market stability.

high negative Exploring barriers to valuation technology adoption in prope... risks to asset pricing, collateral risk modelling and investor confidence

Despite advances in automation, data analytics and AI, the sector has been slow to digitise.

Background statement supported by interview data and sector observation reported in the study.

high negative Exploring barriers to valuation technology adoption in prope... pace of digitisation in the property valuation sector

The IDOI framework provides a transferable model for understanding digital transformation in regulated, high-trust professions and highlights the market-level risks of institutional inertia in property valuation.

Development of the IDOI conceptual framework from qualitative data and theoretical integration; authors' claim about transferability and implications.

high negative Exploring barriers to valuation technology adoption in prope... transferability of the framework and market-level risks from institutional inert...

Generational divides, protectionist attitudes and fears of automation reinforce digital resistance.

Qualitative interview evidence reporting attitudes across cohorts of valuers and firm personnel; thematic analysis identifying cultural and attitudinal themes.

high negative Exploring barriers to valuation technology adoption in prope... cultural/attitudinal resistance to VTech

The Valuers Act (1948), fragmented infrastructure and sovereignty concerns limit innovation.

Interview data from practitioners, firm leaders and regulators in New Zealand citing specific regulatory and infrastructure constraints; thematic analysis.

high negative Exploring barriers to valuation technology adoption in prope... regulatory and infrastructure constraints on innovation

Barriers to adoption arise primarily from institutional conservatism, outdated regulation and weak data governance rather than technical shortcomings.

Qualitative semi-structured interviews with valuers, firm leaders and regulators in New Zealand; thematic analysis guided by Rogers' diffusion of innovations and institutional theory synthesised into the IDOI framework.

high negative Exploring barriers to valuation technology adoption in prope... barriers to VTech adoption

Even access to the true conditional vulnerability probability cannot eliminate misallocation: aleatoric uncertainty over individual vulnerability status is irreducible, and probabilistic targeting inevitably misallocates some resources.

Theoretical argument in the paper (conceptual/theoretical result about irreducible aleatoric uncertainty and its implications for probabilistic targeting).

high negative The Limits of AI-Driven Allocation: Optimal Screening under ... misallocation of resources (allocation error due to aleatoric uncertainty)

Opaque agent objectives, synthetic traffic loops, and the indistinguishability between human-originated and agent-mediated signals are critical measurement problems examined in the paper.

Conceptual examination and literature synthesis; the paper discusses these as open problems rather than providing primary empirical solutions.

high negative The Vanishing User: Web Analytics in an Agent-Dominated Inte... degree of opacity and indistinguishability of agent-mediated versus human-origin...

« Prev 1 2 3 … 15 16 17 … 171 172 Next »