Evidence (11677 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	738	1617
Governance & Regulation	671	334	160	99	1285
Organizational Efficiency	626	147	105	70	955
Technology Adoption Rate	502	176	98	78	861
Research Productivity	349	109	48	322	838
Output Quality	391	121	45	40	597
Firm Productivity	385	46	85	17	539
Decision Quality	277	145	63	34	526
AI Safety & Ethics	189	244	59	30	526
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	106	40	6	188
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	79	8	1	152
Regulatory Compliance	69	66	14	3	152
Training Effectiveness	82	16	13	18	131
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution.

Author assertion / literature-level observation presented in the paper (no empirical sample reported for this claim).

high negative Auditing and Controlling AI Agent Actions in Spreadsheets user oversight ability

A threat model taxonomy mapping misuse vectors to hardware, software, institutional, and liability layers illustrates why no single governance mechanism suffices.

Threat model taxonomy developed in the paper (conceptual taxonomy; illustrative mapping rather than empirical testing).

high negative The Open-Weight Paradox: Why Restricting Access to AI Models... completeness/adequacy of single governance mechanisms

Restricting access to open-weight models deepens asymmetries while driving proliferation into unsupervised settings.

Argumentation and threat-model reasoning in the paper describing likely consequences of restrictions (theoretical analysis; no empirical sample cited).

high negative The Open-Weight Paradox: Why Restricting Access to AI Models... geopolitical asymmetries and proliferation into unsupervised settings

Access restrictions, without governed alternatives, may displace risks rather than reduce them.

Theoretical argument and threat-model analysis in the paper showing possible risk displacement (conceptual reasoning; no empirical sample reported).

high negative The Open-Weight Paradox: Why Restricting Access to AI Models... risk displacement vs risk reduction from access restrictions

Selective forgetting remains underexplored compared to retention in LLM agent memory research.

Authors' literature survey / position statement in paper (assertion made in abstract).

high negative FSFM: A Biologically-Inspired Framework for Selective Forget... extent of research coverage on forgetting vs retention

Beyond technical barriers there are organizational ones: a persistent AI literacy gap, cultural heterogeneity, and governance structures that have not yet caught up with agentic capabilities.

Interview data (over 30) reporting organizational challenges including limited AI literacy, diverse cultural attitudes across organizations, and lagging governance relative to agentic AI capabilities.

high negative Agentic AI in Engineering and Manufacturing: Industry Perspe... organizational readiness factors (AI literacy, culture, governance alignment)

Adoption is constrained less by model capability than by fragmented and machine-unfriendly data, stringent security and regulatory requirements, and limited API-accessible legacy toolchains.

Stakeholder interviews (over 30) reporting barriers to deployment; qualitative synthesis identifies data fragmentation, security/regulatory requirements, and legacy toolchain access as primary constraints.

high negative Agentic AI in Engineering and Manufacturing: Industry Perspe... barriers to AI adoption in engineering/manufacturing

Providing agents feedback about past performance makes them worse at information aggregation and reduces their profits.

Experimental condition where agents received feedback about past performance; compared aggregation (log error of last price) and profits with and without feedback and found worse aggregation and lower profits when feedback was given.

high negative Information Aggregation with AI Agents information aggregation (log error of the last price) and profits

Increasing the complexity of the information structure has a significant and negative impact on information aggregation, suggesting AI agents may suffer from the same limitations as humans when reasoning about others.

Experimental manipulation of information-structure complexity in the controlled trading experiment; measured change in aggregation performance (log error of last price) as complexity increases.

high negative Information Aggregation with AI Agents information aggregation (log error of the last price)

The value alignment problem for artificial intelligence (AI) is often framed as a purely technical or normative challenge, sometimes focused on hypothetical future systems.

Author's literature-based observation and critique in the paper's introduction (conceptual argument; no empirical sample reported).

high negative Relative Principals, Pluralistic Alignment, and the Structur... framing_of_problem_in_literature

Users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns.

Turn-level coding of user behavior in the SWE-chat dataset: proportion of conversational turns containing correction/complaint/interrupt signals, computed across >63,000 user prompts and sessions.

high negative SWE-chat: Coding Agent Interactions From Real Users in the W... rate of user pushback per interaction turn

Agent-written code introduces more security vulnerabilities than code authored by humans.

Comparative analysis of security vulnerabilities attributed to agent-authored code versus human-authored code within the SWE-chat dataset (method details not specified in excerpt).

high negative SWE-chat: Coding Agent Interactions From Real Users in the W... security vulnerabilities introduced by agent-written code versus human-written c...

Just 44% of all agent-produced code survives into user commits.

Empirical measurement of code provenance and survival within the SWE-chat dataset: proportion of agent-produced code that becomes part of subsequent user commits across sessions.

high negative SWE-chat: Coding Agent Interactions From Real Users in the W... survival/usefulness of agent-produced code (proportion incorporated into commits...

Despite rapidly improving capabilities, coding agents remain inefficient in natural settings.

Authors' summary claim supported by dataset-derived metrics such as agent code survival rate (44%) and user pushback (44% of turns); observational analysis of SWE-chat.

high negative SWE-chat: Coding Agent Interactions From Real Users in the W... overall agent efficiency in natural developer workflows (qualitative synthesis)

Regulated deployment imposes four load-bearing systems properties — deterministic replay, auditable rationale, multi-tenant isolation, statelessness for horizontal scale — and stateful architectures violate them by construction.

Conceptual/architectural argument presented in the paper (theoretical analysis), not an empirical measurement in the abstract.

high negative Stateless Decision Memory for Enterprise AI Agents compatibility of stateful architectures with regulatory/system properties

Evaluation of four leading AI platforms shows that standard RAG-based approaches achieve an average of only 15% accuracy when information is insufficient.

Empirical evaluation described in paper: four AI platforms tested on benchmark; reported average accuracy of 15% for RAG-based approaches on cases with insufficient information.

high negative Learning When Not to Decide: A Framework for Overcoming Fact... accuracy on cases where information is insufficient (inconclusive cases)

Unemployment insurance adjudication has seen rapid integration of AI systems and the question of additional fact-finding poses the most significant bottleneck for a system that affects millions of applicants annually.

Contextual/introductory claim in paper; references to domain-scale impact and bottleneck; no specific numeric study sample provided in excerpt.

high negative Learning When Not to Decide: A Framework for Overcoming Fact... scale of impact (number of applicants affected) and fact-finding bottleneck in a...

A well-known limitation of AI systems is presumptuousness: the tendency of AI systems to provide confident answers when information may be lacking.

Statement in paper framing the problem; general literature/contextual claim (no specific experiment cited in the excerpt).

high negative Learning When Not to Decide: A Framework for Overcoming Fact... tendency to provide confident answers when information is lacking (presumptuousn...

Brevity, semantic isolation and rhetorical register independently predict representational outcome (i.e., which submissions are included/excluded in summaries).

Statistical/semantic analysis (presumably regression or causal inference) reported in the paper linking textual features—brevity, semantic isolation, rhetorical register—to representational outcomes.

high negative Participatory provenance as representational auditing for AI... predictive relationship between textual features and representational outcome (c...

Exclusion concentrates in clusters expressing dissent, scepticism and critique of AI, with exclusion rates of 33%–88% in such clusters.

Cluster/semantic analysis reported in the paper showing higher exclusion rates for clusters labeled as dissent/scepticism/critique.

high negative Participatory provenance as representational auditing for AI... cluster-level exclusion rate for dissenting/sceptical/critical clusters

In topic B, 15.3% of participants are effectively excluded by the official summary.

Empirical measurement reported in the paper quantifying participants 'effectively excluded' when comparing source submissions to official summary coverage.

high negative Participatory provenance as representational auditing for AI... participant exclusion rate

In topic A, 16.9% of participants are effectively excluded by the official summary.

Empirical measurement reported in the paper quantifying participants 'effectively excluded' when comparing source submissions to official summary coverage.

high negative Participatory provenance as representational auditing for AI... participant exclusion rate

Both official government summaries underperform a random-participant baseline for topic B (coverage degradation of -8.0%).

Empirical comparison in the paper between official government summary and a random-participant baseline using the n=5,253 consultation responses.

high negative Participatory provenance as representational auditing for AI... coverage (coverage degradation relative to random baseline)

Both official government summaries underperform a random-participant baseline for topic A (coverage degradation of -9.1%).

Empirical comparison in the paper between official government summary and a random-participant baseline using the n=5,253 consultation responses.

high negative Participatory provenance as representational auditing for AI... coverage (coverage degradation relative to random baseline)

No single policy instrument is sufficient to produce high regional science and technology industrial competitiveness.

Result of fuzzy-set qualitative comparative analysis (fsQCA) on AI policy instruments issued by provincial-level governments in China, reported in the study; fsQCA finds no individual condition is sufficient.

high negative How Can Artificial Intelligence Policies Promote the Sustain... regional science and technology industrial competitiveness

LLMs endorsed fraudulent investments at 0% across all models tested.

Preregistered experiment across seven leading LLMs producing 3,360 AI advisory conversations; reported 0% endorsement of objectively fraudulent opportunities.

high negative Large Language Models Outperform Humans in Fraud Detection a... endorsement rate of fraudulent investments by LLMs

Endorsement reversal occurred in fewer than 3 in 1,000 observations.

Observed incidence reported from the preregistered experiment (3,360 AI advisory conversations); statement in paper reporting incidence <3/1,000.

high negative Large Language Models Outperform Humans in Fraud Detection a... rate of endorsement reversal (AI shifting from warning to endorsing fraudulent o...

Critical gaps persist in explainability, regulatory alignment, ethical governance, and context-specific validation.

Authors' synthesis and Conclusion listing persistent shortcomings identified across the reviewed literature.

high negative AI-Driven Financial Risk Management and Decision Intelligenc... presence of gaps in explainability, regulation, ethics, and validation

Integration of decision intelligence principles into AI applications for financial risk management in emerging markets is nascent.

Authors' synthesis noting limited presence of decision intelligence frameworks or hybrid human-AI decision processes across the reviewed literature.

high negative AI-Driven Financial Risk Management and Decision Intelligenc... degree of decision intelligence integration

There is limited empirical validation of AI approaches in emerging market settings.

Review finding described in Results and Conclusion: comparatively few studies provide robust, context-specific empirical validation for emerging markets despite general claims of effectiveness.

high negative AI-Driven Financial Risk Management and Decision Intelligenc... extent of empirical validation in emerging markets

Disparities emerge and compound across stages of the ML pipeline (training data, model predictions, and post-processing).

Pipeline-level analysis reported in paper showing sources of disparity at multiple stages and how effects accumulate from training data through prediction to post-processing.

high negative Fairness Audits of Institutional Risk Models in Deployed ML ... cumulative disparity across pipeline stages

Post-processing amplifies these disparities by collapsing heterogeneous probabilities into percentile-based risk tiers.

Analysis of the pipeline showing that converting model probabilities into percentile-based risk tiers (post-processing step) increases observed disparities across demographic groups.

high negative Fairness Audits of Institutional Risk Models in Deployed ML ... change in disparity magnitude after post-processing (probability → percentile ri...

Older and female students with comparable dropout risk are under-identified by the EWS.

Audit comparison showing lower identification/flagging rates for older and female students who have comparable modeled or observed dropout risk to other groups; reported as part of the pipeline disparities analysis.

high negative Fairness Audits of Institutional Risk Models in Deployed ML ... identification/flagging rate for support relative to comparable dropout risk

Younger, male, and international students are disproportionately flagged for support by the EWS, even when many ultimately succeed.

Empirical results from the replica-based audit comparing model predictions and post-processing flags against eventual student outcomes; disparities reported by demographic groups (age, gender, residency). Exact sample size and numerical metrics not provided in the abstract.

high negative Fairness Audits of Institutional Risk Models in Deployed ML ... rate of being flagged for support (EWS risk flag) versus eventual success/dropou...

Recent policy and academic discourse has increasingly acknowledged the infeasibility of fullstack AI sovereignty, but has not yet provided an integrating theoretical architecture for governing dependence under these conditions.

Literature/policy-discourse claim made in the paper (review/interpretation). No empirical sampling or quantitative evidence reported in the provided text.

high negative Digital Sovereignty in the Global Cognitive-Informational Or... feasibility of full technological autonomy (fullstack AI sovereignty) and the pr...

The concentration of AI-related infrastructures is coalescing into distinct geocognitive power poles whose competing infrastructural ecosystems generate structural asymmetries that position small and medium-sized states within regimes of cognitive-informational dependence.

Theoretical/geopolitical argument introduced in the paper (conceptual framing). No empirical sample size or quantitative measurement provided in the excerpt.

high negative Digital Sovereignty in the Global Cognitive-Informational Or... structural asymmetries and dependence of small and medium-sized states on domina...

There is a growing concentration of computational capacity, data ecosystems, and advanced model architectures within a limited number of technological actors, signaling the emergence of a cognitive-informational order in which influence is exercised through the architectures that shape how knowledge is generated, interpreted, and operationalized.

Theoretical/observational assertion in the paper (conceptual synthesis). No empirical details, sample sizes, or quantitative analyses provided in the supplied text.

high negative Digital Sovereignty in the Global Cognitive-Informational Or... concentration of technological capabilities and resulting influence over knowled...

The policy and research challenge posed by platform-mediated automation is not merely job quantity (technological unemployment) but institutional continuity — how societies reproduce practical competence when platforms optimize for efficiency rather than formation.

Normative and conceptual claim developed through literature synthesis (institutional economics, platform governance, workforce development); presented as an analytical reframing rather than an empirically tested hypothesis.

high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... institutional continuity and human capital reproduction (quality of workforce fo...

Entry-level roles have historically functioned as apprenticeships in which workers acquire tacit knowledge and critical judgment; if platforms curtail these formative occupational layers, organizations may lack future workers capable of exercising contextual reasoning required to manage complex systems.

Institutional economics and workforce development literature cited in the paper; conceptual synthesis without original empirical measurement reported.

high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... human capital formation (tacit knowledge acquisition and contextual reasoning ca...

Platform-mediated automation risks hollowing out labor structures from both directions: eroding repetitive, junior roles from below and automating supervisory coordination functions from above.

Theoretical argument synthesizing institutional economics and platform literature; articulated as a conceptual risk rather than demonstrated with original empirical data.

high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... structural change in occupational layers (hollowing out of junior and supervisor...

Algorithmic systems are displacing routine tasks across both low-wage entry-level work and middle-management functions.

Stated in paper's argumentation; supported by a literature-based review drawing on platform governance literature and recent research on AI-enhanced automation (no original empirical sample or quantitative study reported).

high negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... displacement of routine tasks (across entry-level and middle-management roles)

The observed negative OPM effect is consistent with short-term 'J-curve' transition costs (process redesign and capability buildup) during early AI adoption.

Interpretation of empirical patterns (short-term decline in OPM concurrent with no ROA change) offered by the authors as an explanatory mechanism; not presented as separately estimated or experimentally tested.

high negative The Dynamic Causal Effects of Corporate AI Adoption on Profi... operating profit margin dynamics / transition costs interpretation

AI adoption had a significantly negative impact on the operating profit margin (OPM).

Causal analysis of KOSDAQ-listed companies (2018–2025) with AI-adoption timing identified via multi-step, contextually validated text analysis of DART business reports; endogeneity addressed using two-way fixed effects (TWFE) and Propensity Score Matching (PSM).

high negative The Dynamic Causal Effects of Corporate AI Adoption on Profi... operating profit margin (OPM)

For agentic systems, there are three structural breaks: decision diffusion, evidence fragmentation, and responsibility ambiguity.

Analytical identification and labeling of three specific structural problems for agentic AI within the paper's argumentation.

high negative Governed Auditable Decisioning Under Uncertainty: Synthesis ... types of structural governance failures in agentic AI

The paper introduces the 'cascade of uncertainty', showing how governance failures propagate through serial dependencies between framework layers.

Conceptual/theoretical model introduced and analyzed in the paper (cascade model linking framework layers and failure propagation).

high negative Governed Auditable Decisioning Under Uncertainty: Synthesis ... propagation of governance failure/uncertainty across framework layers

Agentic AI systems encounter structural breaks that prevent normal framework fillability.

Paper's analytic assessment reports that agentic AI systems cause structural breaks undermining the framework's ability to fill DES-properties.

high negative Governed Auditable Decisioning Under Uncertainty: Synthesis ... framework fillability / governance evidence coverage in agentic systems

Classical ML systems achieve only minimal DES-property fillability.

Analytic comparison in the paper classifies classical ML systems as providing minimal governance evidence fillability.

high negative Governed Auditable Decisioning Under Uncertainty: Synthesis ... DES-property fillability

When automated decision systems fail, organizations frequently discover that formally compliant governance infrastructure cannot reconstruct what happened or why.

Asserted by the paper as an observed problem motivating the study; presented as a general empirical/experiential claim (literature/examples synthesis) rather than a controlled empirical estimate.

high negative Governed Auditable Decisioning Under Uncertainty: Synthesis ... ability of governance infrastructure to reconstruct decisions (post-hoc explaina...

Artificial intelligence introduces systemic risks through unprovenanced AI-derived metadata.

Cautionary claim made by the authors; stated as a systemic risk linked to provenance issues of AI-generated metadata, without empirical incident data in the excerpt.

high negative Market Dynamics, Governance and Open Research Metadata in th... systemic risk from unprovenanced AI-derived metadata (e.g., reduced trust, relia...

The debate about scholarly knowledge infrastructure has long been framed as a contest between openness and commercial enclosure, and this framing distorts both policy and practice.

Conceptual/persuasive claim made in the paper's opening paragraph; no empirical data or sample reported in the excerpt.

high negative Market Dynamics, Governance and Open Research Metadata in th... policy and practice framing (openness vs commercial enclosure)

« Prev 1 2 3 … 23 24 25 … 233 234 Next »