Evidence (13661 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	740	192	95	871	1945
Governance & Regulation	796	388	185	119	1512
Organizational Efficiency	765	186	123	82	1166
Technology Adoption Rate	610	227	121	95	1061
Research Productivity	409	121	56	331	928
Output Quality	464	174	58	47	743
Decision Quality	318	173	75	42	615
Firm Productivity	432	55	88	20	601
AI Safety & Ethics	214	273	65	33	589
Market Structure	175	165	120	24	489
Task Allocation	206	64	70	31	376
Skill Acquisition	161	57	57	16	291
Innovation Output	201	27	41	18	288
Fiscal & Macroeconomic	130	69	43	26	275
Employment Level	104	50	105	13	274
Consumer Welfare	116	62	42	11	231
Firm Revenue	149	45	26	3	223
Inequality Measures	43	120	49	6	218
Task Completion Time	164	29	8	12	214
Worker Satisfaction	89	60	20	12	181
Error Rate	69	89	9	2	169
Regulatory Compliance	74	67	14	4	159
Training Effectiveness	91	19	13	19	144
Wages & Compensation	77	33	25	6	141
Team Performance	86	17	27	9	140
Automation Exposure	49	50	22	12	136
Developer Productivity	91	17	14	5	128
Job Displacement	12	80	19	1	112
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	16	7	2	57
Skill Obsolescence	5	43	6	1	55
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

HAI-C task complexity increases employees' HAI-C tech-learning anxiety.

Longitudinal survey data (n=497) analyzed with hierarchical regression; reported as a finding in the Results that task complexity amplifies tech-learning anxiety.

high positive How does human-AI collaboration task complexity affect emplo... HAI-C tech-learning anxiety

When models err, their incorrect predictions disproportionately lean intervention-oriented.

Error analysis of model predictions showing that among incorrect predictions, a larger share favor intervention-oriented causal signs than market-oriented ones (directional skew in errors).

high positive Ideological Bias in LLMs' Economic Causal Reasoning directional bias in errors (proportion of errors that are intervention-oriented)

Across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones.

Model-by-model accuracy comparison broken down by whether the empirically verified causal sign aligns with intervention-oriented vs market-oriented expectations; observed higher accuracy for intervention-aligned cases in 18/20 models.

high positive Ideological Bias in LLMs' Economic Causal Reasoning accuracy conditional on ideological alignment (intervention-oriented vs market-o...

GenAI-related benefits are likely to materialize only when AI capabilities are embedded in standardized routines, integrated data infrastructures, and cross-functional governance arrangements (organizational embedding).

Paper's synthesized process model and interpretive case evidence from the three firms indicating organizational conditions required for observed/documented AI effects.

high positive Research on the Impact of Generative AI on the Quality of Ma... realization of GenAI benefits for management accounting decision quality

GenAI-related capabilities enhance analysis by translating complex data into more interpretable, scenario-sensitive, and action-oriented outputs (analytical augmentation).

Interpretive finding from analysis of disclosures and literature; presented as a second linked mechanism through which GenAI may influence management accounting.

high positive Research on the Impact of Generative AI on the Quality of Ma... management accounting decision quality (via improved analysis/interpretability)

GenAI-related capabilities broaden the informational basis of management accounting by making operational, service, quality, and ecosystem data more usable in planning and control (information enrichment).

Interpretive inference from corporate disclosures of the three firms and review of AI-and-accounting literature; described as a primary mechanism in the paper.

high positive Research on the Impact of Generative AI on the Quality of Ma... management accounting decision quality (via information breadth/usability)

Meaningful human oversight of AI agents in knowledge work requires not improved post-hoc review mechanisms, but active participation in decisions as they are made.

Authors' conclusion drawn from the formative (N=8) and summative (N=16) studies and associated observations.

high positive Auditing and Controlling AI Agent Actions in Spreadsheets oversight effectiveness (design implication favoring in-line/active participatio...

Users reported a sense of co-ownership over the resulting output.

Participant self-reports from the formative and/or summative studies (authors report users expressed co-ownership of outputs when participating in execution).

high positive Auditing and Controlling AI Agent Actions in Spreadsheets sense of ownership / co-ownership

Users detected errors that post-hoc review would have failed to surface.

Empirical observation reported from the studies (authors report that active participation allowed users to detect errors that would be missed by post-hoc review).

high positive Auditing and Controlling AI Agent Actions in Spreadsheets error detection (compared to post-hoc review)

Users identified their own intent reflected in the agent's actions.

Reported participant observations/self-reports from the formative (N=8) and/or summative (N=16) studies; claim presented as a finding of the evaluations.

high positive Auditing and Controlling AI Agent Actions in Spreadsheets alignment between user intent and agent actions

A formative study (N = 8) and a within-subjects summative evaluation (N = 16) comparing Pista to a baseline agent demonstrated that active participation in execution influenced not only task outcomes but also users' comprehension of the task, their perception of the agent, and their sense of role within the workflow.

Empirical evaluation consisting of a formative study with N=8 and a within-subjects summative evaluation with N=16 comparing Pista to a baseline agent (authors report influence on task outcomes, comprehension, perception, and role).

high positive Auditing and Controlling AI Agent Actions in Spreadsheets task outcomes (primary claim), plus user comprehension, perception, and role sen...

We introduce Pista, a spreadsheet AI agent that decomposes execution into auditable, controllable actions, providing users with visibility into the agent's decision-making process and the capacity to intervene at each step.

System description / design contribution presented by the authors (implementation description rather than empirical evidence).

high positive Auditing and Controlling AI Agent Actions in Spreadsheets availability of auditable, controllable actions and ability to intervene

Effective governance of AI as a dual-use technology will likely require a multilateral institutional architecture functionally analogous (though not identical) to the role performed by the IAEA in the nuclear domain, with explicit safeguards against co-option of hardware controls for domestic repression.

Normative institutional design argument and analogy to the IAEA presented in the paper (policy proposal; comparative institutional analysis).

high positive The Open-Weight Paradox: Why Restricting Access to AI Models... need for multilateral institutional governance to manage dual-use AI

Hardware-layer governance, including chip-level attestation mechanisms such as FlexHEG, trusted execution environments, confidential computing, and complementary software-layer safeguards, offers a defense-in-depth alternative to the current binary framing of openness vs restriction.

Proposed governance architecture and technical discussion in the paper citing concrete mechanisms (technical-proposal and conceptual analysis; no experimental or deployment data reported in the summary).

high positive The Open-Weight Paradox: Why Restricting Access to AI Models... effectiveness of hardware-plus-software safeguards as an alternative governance ...

The global concentration of compute infrastructure makes open-weight models one of the most viable pathways to sovereign AI capacity in the Global South.

Analysis of global compute infrastructure concentration and pathway mapping in the paper (conceptual/structural analysis; no numerical sample provided in the summary).

high positive The Open-Weight Paradox: Why Restricting Access to AI Models... pathways to sovereign AI capacity (access/adoption of open-weight models)

Selective forgetting should be considered a fundamental capability for next-generation LLM agents operating in real-world, resource-constrained scenarios.

Conclusion/argument in paper based on conceptual analysis and reported empirical benefits.

high positive FSFM: A Biologically-Inspired Framework for Selective Forget... necessity of selective forgetting for future LLM agents

The work bridges cognitive neuroscience (hippocampal indexing/consolidation theory and Ebbinghaus forgetting curve) and AI systems to inform forgetting mechanisms.

Claimed theoretical grounding and cross-disciplinary framing in paper (stated in abstract).

high positive FSFM: A Biologically-Inspired Framework for Selective Forget... theoretical alignment between neuroscience and AI forgetting mechanisms

Empirical results show security performance with 100% elimination of security risks.

Reported experimental result in abstract claiming full elimination of security risks.

high positive FSFM: A Biologically-Inspired Framework for Selective Forget... security risk elimination

Empirical results show content quality improved by +29.2% signal-to-noise ratio.

Reported experimental result in abstract (signal-to-noise ratio improvement).

high positive FSFM: A Biologically-Inspired Framework for Selective Forget... content quality (signal-to-noise ratio)

Empirical results show access efficiency improved by +8.49%.

Reported experimental result in abstract.

high positive FSFM: A Biologically-Inspired Framework for Selective Forget... access efficiency

Building on advances in LLM agent architectures and vector databases, the paper presents detailed specifications, implementation strategies, and empirical validation from controlled experiments.

Methodological claim in abstract indicating implementation and controlled experiments (no experimental details in abstract).

high positive FSFM: A Biologically-Inspired Framework for Selective Forget... presence of implementation details and experimental validation

Selective forgetting improves security through active forgetting of malicious inputs, sensitive data, and privacy-compromising content.

Authors' taxonomy and safety-triggered forgetting mechanism; abstract reports empirical security performance (100% elimination of security risks).

high positive FSFM: A Biologically-Inspired Framework for Selective Forget... security performance (elimination of security risks)

Selective forgetting improves content quality by dynamically updating outdated preferences and context.

Conceptual claim supported by authors' implementation and empirical validation; abstract reports content quality improvement (signal-to-noise ratio).

high positive FSFM: A Biologically-Inspired Framework for Selective Forget... content quality (signal-to-noise ratio)

A well-designed forgetting mechanism improves efficiency via intelligent memory pruning.

Claim supported by authors' framework and controlled experiments reported in the paper (abstract references empirical results for access efficiency).

high positive FSFM: A Biologically-Inspired Framework for Selective Forget... access efficiency

In resource-constrained environments, a well-designed forgetting mechanism is as crucial as remembering.

Argument and conceptual analysis in paper; motivated by theoretical considerations and (claimed) empirical validation.

high positive FSFM: A Biologically-Inspired Framework for Selective Forget... relative importance of forgetting vs remembering for system performance

The findings point to a staged progression of AI utility from low-consequence assistance toward higher-order automation, as trust, infrastructure, and verification mature.

Synthesis of interview responses (over 30) indicating current use cases are lower-risk assistance and that stakeholders expect (or prefer) gradual progression toward automation contingent on trust/infrastructure/verification improvements.

high positive Agentic AI in Engineering and Manufacturing: Industry Perspe... trajectory of AI deployment (from assistance to automation) conditional on matur...

Reliability, verification, and auditability are central requirements for adoption, driving human-in-the-loop frameworks and governance aligned with existing engineering reviews.

Consistent themes from interviews (over 30) indicating stakeholders prioritize reliability, verifiability, and audit trails, leading to preference for human-in-the-loop designs integrated with current review processes.

high positive Agentic AI in Engineering and Manufacturing: Industry Perspe... requirements driving adoption decisions (reliability, verification, auditability...

Higher-value agentic gains come from orchestrating multi-step workflows across tools.

Observed and reported in interviews (over 30) with stakeholders in engineering and manufacturing workflows describing value from agentic orchestration across tools.

high positive Agentic AI in Engineering and Manufacturing: Industry Perspe... value generated by agentic AI when coordinating multi-step toolchains

Near-term AI gains cluster around structured, repetitive work and data-intensive synthesis.

Qualitative findings from an exploratory state-of-practice study based on over 30 semi-structured interviews across four stakeholder groups (large enterprises, small/medium firms, AI developers, and CAD/CAM/CAE vendors).

high positive Agentic AI in Engineering and Manufacturing: Industry Perspe... locations/types of tasks where AI provides near-term value (structured/repetitiv...

‘Smarter’ AI agents are more profitable.

Measured profits earned by agents of different capability levels in the trading experiment and observed higher profits for higher-capability ('smarter') agents.

high positive Information Aggregation with AI Agents profits (agent-level earnings)

‘Smarter’ AI agents perform better at information aggregation.

Experimental comparison of AI agents with different capability levels ('smarter' vs. less smart) in the trading experiment; measured aggregation via log error of last price and found better performance for higher-capability agents.

high positive Information Aggregation with AI Agents information aggregation (log error of the last price)

Prediction markets are robust to cheap talk, market duration, initial price, and strategic prompting.

Synthesis of experimental results showing no change in aggregation performance across manipulations (cheap talk, duration, initial price, strategic prompting).

high positive Information Aggregation with AI Agents information aggregation (log error of the last price)

The median market is effective at aggregating information in the easy information structures.

Controlled laboratory experiment in which AI agents traded in prediction markets after receiving private signals; information aggregation measured by the log error of the last price; comparison across 'easy' information structures using median-market outcomes.

high positive Information Aggregation with AI Agents information aggregation (log error of the last price)

Because misalignment can occur along each axis -- and affect stakeholders differently -- alignment cannot be 'solved' through technical design alone, but must be managed through ongoing institutional processes that determine how objectives are set, how systems are evaluated, and how affected communities can contest or reshape those decisions.

Normative conclusion drawn from the three-axis framework and discussion of stakeholder impacts (conceptual policy prescription; no empirical testing reported).

high positive Relative Principals, Pluralistic Alignment, and the Structur... feasibility_of_technical_only_solutions

Alignment is inherently pluralistic and context-dependent, and resolving misalignment involves trade-offs among competing values.

Theoretical and normative argument in the paper about pluralism and context-dependence of values (conceptual discussion; no empirical quantification).

high positive Relative Principals, Pluralistic Alignment, and the Structur... nature_of_alignment_solutions

The three-axis decomposition implies that alignment is fundamentally a problem of governance rather than engineering alone.

Logical inference from the proposed decomposition and normative argument in the paper (theoretical reasoning; no empirical evidence).

high positive Relative Principals, Pluralistic Alignment, and the Structur... primary_domain_responsible_for_alignment

The three-axis framework provides a systematic way of diagnosing why misalignment arises in real-world systems and clarifies that alignment cannot be treated as a single technical property of models but an outcome shaped by how objectives are specified, how information is distributed, and whose interests count in practice.

Conceptual argument and analytic claim about the explanatory utility of the proposed framework (theoretical demonstration; no empirical tests reported).

high positive Relative Principals, Pluralistic Alignment, and the Structur... diagnostic_power_of_framework

Misalignment can be reconceptualised as arising along three interacting axes: objectives, information, and principals (drawing on the principal–agent framework).

Theoretical framing using the principal–agent framework; conceptual decomposition proposed in the paper (no empirical validation reported).

high positive Relative Principals, Pluralistic Alignment, and the Structur... sources_of_misalignment

The alignment problem is better understood as a structural question about governance: not whether an AI system is aligned in the abstract, but whether it is aligned enough, for whom, and at what cost.

Normative and conceptual argument presented by the author proposing a governance-focused reconceptualization (theoretical analysis; no empirical data).

high positive Relative Principals, Pluralistic Alignment, and the Structur... interpretation_of_alignment_problem

SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories.

Description of the dataset collection infrastructure and pipeline provided in the paper; operational behavior asserted by authors.

high positive SWE-chat: Coding Agent Interactions From Real Users in the W... dataset collection process (automated, continual discovery from public repositor...

The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls.

Descriptive statistics reported by the authors based on their dataset collection pipeline (dataset metadata).

high positive SWE-chat: Coding Agent Interactions From Real Users in the W... dataset size (sessions, prompts, agent tool calls)

We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild.

Paper authorship / dataset description; dataset curated and presented by the paper as a contribution. No external validation provided in excerpt.

high positive SWE-chat: Coding Agent Interactions From Real Users in the W... existence and scale of the SWE-chat dataset (novel dataset release)

Statelessness is the load-bearing property explaining enterprises' preference for weaker but replayable retrieval pipelines, and DPM demonstrates this property is attainable without the decisioning penalty retrieval pays.

Synthesis/conclusion based on theoretical argument and empirical results presented (architectural analysis + experiments showing DPM performance and auditability).

high positive Stateless Decision Memory for Enterprise AI Agents trade-off between stateless architectures and decisioning performance / auditabi...

The audit surface follows the same one-versus-N pattern: DPM logs two LLM calls per decision while summarization logs 83-97 on LongHorizon-Bench.

Empirical measurement on LongHorizon-Bench reported in the paper: logged LLM calls per decision are 2 for DPM vs 83-97 for summarization.

high positive Stateless Decision Memory for Enterprise AI Agents number of LLM calls logged per decision (audit surface)

DPM is additionally 7-15x faster at binding budgets, making one LLM call at decision time instead of N.

Empirical runtime/efficiency measurement reported in the paper (range 7-15x speedup) comparing number of LLM calls and latency under tight memory budgets.

high positive Stateless Decision Memory for Enterprise AI Agents decision-time latency / number of LLM calls

At a 20x compression ratio, DPM improves reasoning coherence by +0.53 (Cohen's h=1.13, p=0.0034) compared to summarization-based memory (paired permutation, n=10).

Paired permutation test over 10 cases at a 20x compression ratio; reported effect +0.53 with Cohen's h=1.13 and p=0.0034.

high positive Stateless Decision Memory for Enterprise AI Agents reasoning coherence

At a 20x compression ratio, DPM improves factual precision by +0.52 (Cohen's h=1.17, p=0.0014) compared to summarization-based memory (paired permutation, n=10).

Paired permutation test over 10 cases at a 20x compression ratio; reported effect +0.52 with Cohen's h=1.17 and p=0.0014.

high positive Stateless Decision Memory for Enterprise AI Agents factual precision

On ten regulated decisioning cases at three memory budgets, DPM matches summarization-based memory at generous budgets and substantially outperforms it when the budget binds.

Empirical evaluation on 10 decisioning cases across three memory budgets; comparison between DPM and summarization-based memory as reported in the paper (n=10).

high positive Stateless Decision Memory for Enterprise AI Agents relative performance (match/outperform) of DPM vs summarization-based memory acr...

We propose Deterministic Projection Memory (DPM): an append-only event log plus one task-conditioned projection at decision time.

Method/architectural proposal described in the paper.

high positive Stateless Decision Memory for Enterprise AI Agents architecture design (DPM specification)

Presumptuousness in legal AI is systematic but addressable, and addressing it is a necessary step towards systems that reliably support, rather than supplant, human judgment wherever decisions must await sufficient evidence.

Synthesis conclusion in paper based on the benchmark experiments, comparisons across prompting methods, and SPEC results.

high positive Learning When Not to Decide: A Framework for Overcoming Fact... reliability of AI systems to support human judgment under insufficient evidence ...

« Prev 1 2 3 … 133 134 135 … 273 274 Next »