Evidence (5157 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	736	1615
Governance & Regulation	664	329	160	99	1273
Organizational Efficiency	624	143	105	70	949
Technology Adoption Rate	502	176	98	78	861
Research Productivity	348	109	48	322	836
Output Quality	391	120	44	40	595
Firm Productivity	385	46	85	17	539
Decision Quality	275	143	62	34	521
AI Safety & Ethics	183	241	59	30	517
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	105	40	6	187
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	78	8	1	151
Regulatory Compliance	69	64	14	3	150
Training Effectiveness	81	15	13	18	129
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

Uncertainty-aware exploration (in algorithms) alters fairness metrics compared to policies that ignore uncertainty.

Results from simulation experiments compare uncertainty-aware exploration policies to baseline policies and report changes in fairness metrics (as described in the abstract and results).

high mixed Fairness under uncertainty in sequential decisions fairness metrics

For LLM agents, memory management critically impacts efficiency, quality, and security.

Statement in paper framing and motivation; supported conceptually by literature linking memory design to system properties (no specific experimental details provided in abstract).

high mixed FSFM: A Biologically-Inspired Framework for Selective Forget... efficiency, content quality, and security of LLM agents

The experimental findings are consistent with the paper's theoretical predictions.

Comparison reported in the paper between theoretical model predictions and observed outcomes from the controlled AI-agent trading experiments.

high mixed Information Aggregation with AI Agents consistency between theoretical predictions and experimental measures (e.g., agg...

Coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves.

Empirical analysis of authorship attribution across the 6,000 sessions in the SWE-chat dataset; percentages derived from session-level classification.

high mixed SWE-chat: Coding Agent Interactions From Real Users in the W... distribution of code authorship across sessions (agent-dominant vs human-only se...

A determinism study of 10 replays per case at temperature zero shows both architectures inherit residual API-level nondeterminism, but DPM exposes one nondeterministic call while summarization exposes N compounding calls.

Determinism experiment with 10 replays per case at temperature zero; qualitative/quantitative observation about number of nondeterministic LLM calls exposed by each architecture.

high mixed Stateless Decision Memory for Enterprise AI Agents system nondeterminism / number of nondeterministic LLM calls exposed per decisio...

Advanced prompting methods improve accuracy on inconclusive cases but over-correct, withholding decisions even on clear cases.

Empirical comparison of prompting methods reported in paper: advanced prompts increased accuracy on inconclusive (insufficient-information) cases but led to excessive deferral/withholding on clear cases.

high mixed Learning When Not to Decide: A Framework for Overcoming Fact... accuracy on inconclusive cases and rate of withholding/deferral on clear cases

Multi-agent workflows and benchmark evaluation reveal current capabilities, limitations, and research frontiers in agentic AI for physical design.

The paper states it analyzes recent experience with multi-agent workflows and benchmark evaluation; the abstract does not provide specific benchmark names, metrics, or sample sizes.

high mixed Invited: Agentic AI for Physical Design R&D: Status and Pros... capabilities and limitations as identified via multi-agent workflows and benchma...

The study was a preregistered experiment across seven leading LLMs and twelve investment scenarios covering legitimate, high-risk, and objectively fraudulent opportunities.

Methodological description in the paper stating preregistration, 7 LLMs, 12 scenarios; combined dataset included 3,360 AI advisory conversations and a 1,201-participant human benchmark.

high mixed Large Language Models Outperform Humans in Fraud Detection a... study design characteristics (models tested and scenario types)

Given the results, educators should revisit pair programming as an educational tool in addition to embracing modern AI.

Authors' recommendation in the paper's conclusion based on experimental findings (performance, workload, emotion, retention outcomes).

high mixed Fast and Forgettable: A Controlled Study of Novices' Perform... educational practice recommendation (pair programming vs AI-assisted instruction...

Formal network verification has made substantial progress in proving correctness properties but is typically applied in offline, pre-deployment settings and faces challenges in accommodating continuous changes and validating live production behavior.

Authors' summary of the state of the art in network verification (assertion in paper; no empirical data in abstract).

high mixed Aether: Network Validation Using Agentic AI and Digital Twin applicability of formal verification to live/continuous change

Results also reveal divergences between the two interaction scenario types.

Abstract statement that divergences vary across different interaction contexts / scenario types.

high mixed Imperfectly Cooperative Human-AI Interactions: Comparing the... scenario_specific_outcomes

Results reveal divergences between purely simulated and human study datasets.

Abstract reports that findings diverge between simulation experiments and the human-subjects dataset; comparisons drawn across the two datasets (simulation N=2000, human N=290).

high mixed Imperfectly Cooperative Human-AI Interactions: Comparing the... comparative_outcomes_between_datasets

Experienced developers maintain control through detailed delegation while novices struggle between over-reliance and cautious avoidance.

Observed behaviors and accounts from the AI-assisted debugging task (10 juniors) and senior participants in ACTA/Delphi and blind review phases (5 + 5 seniors).

high mixed From Junior to Senior: Allocating Agency and Navigating Prof... Control over AI tools (detailed delegation) vs patterns of novice behavior (over...

AI is not just changing how engineers code—it is reshaping who holds agency across work and professional growth.

Qualitative synthesis of findings across the three-phase study (Delphi with 5 seniors; debugging task with 10 juniors; blind reviews by 5 seniors).

high mixed From Junior to Senior: Allocating Agency and Navigating Prof... Distribution of agency (decision-making control) across roles and career develop...

How software developers interact with AI-powered tools, including Large Language Models (LLMs), plays a vital role in how these AI-powered tools impact them.

Based on qualitative analysis of twenty-two interviews with software developers about using LLMs for software development; asserted as a central finding in the paper's analysis.

high mixed Towards an Appropriate Level of Reliance on AI: A Preliminar... impact of AI tools on developers (broadly: productivity, skills, quality)

No aggregation mechanism can simultaneously satisfy all desiderata of collective rationality (connection to Arrow's Impossibility Theorem); multi-agent deliberation navigates rather than resolves this constraint.

Theoretical argument connecting empirical multi-agent deliberation results to Arrow's Impossibility Theorem and observations that deliberation trades off competing desiderata rather than achieving all simultaneously.

high mixed Beyond Arrow's Impossibility: Fairness as an Emergent Proper... satisfiability of collective rationality desiderata under aggregation mechanisms

Alignment systematically shapes negotiation strategies and allocation patterns between agents.

Experimentally comparing negotiation behavior and allocation outcomes across agent pairs where one agent is aligned (via RAG) and the partner is either unaligned or adversarially prompted; patterns of strategy and allocation differences reported.

high mixed Beyond Arrow's Impossibility: Fairness as an Emergent Proper... negotiation strategies and resource allocation patterns

The design space articulates four configurations—No AI, Hidden AI, Translucent AI, and Visible AI—each trading off among accountability, autonomy, and coordination cost.

Conceptual taxonomy introduced in the paper (design artifact). No empirical evaluation or sample reported in the abstract; tradeoffs are argued theoretically.

high mixed Who Gets Credit? Operationalizing AI Disclosure as Epistemic... tradeoffs among accountability, autonomy, coordination cost under different disc...

CLARITI matches GPT-5's resolution rate on underspecified issues while generating 41% fewer questions.

Empirical evaluation comparing CLARITI and GPT-5 on a task set of underspecified software engineering issues; the result reported in the abstract indicates parity in resolution rate and a quantified reduction in questions (41%) but the abstract does not report sample size, test set composition, or statistical significance.

high mixed Asking What Matters: Reward-Driven Clarification for Softwar... resolution rate (task success) and number of clarifying questions generated

They can produce fluent outputs that resemble reflection, but lack temporal continuity, causal feedback, and anchoring in real-world interaction.

Descriptive claim made in the text contrasting surface-level fluency with missing properties; no empirical data or experiments provided.

high mixed Governing Reflective Human-AI Collaboration: A Framework for... fluency vs. temporal_continuity, causal_feedback, real-world_anchoring

A within-subject human study with 20 players and 600 games shows that our interventions significantly improve performance for low- and mid-skill players while matching expert-engine interventions for high-skill players.

Within-subject human experiment reported in the paper: N = 20 players, 600 games total; comparisons of performance under the proposed interventions versus expert-engine interventions.

high mixed Improving Human Performance with Value-Aware Interventions: ... human player performance in chess games (game outcomes / performance metrics) by...

This work establishes a foundation for understanding how generative AI systems not only augment cognitive performance but also reshape self-perception and perceived expertise.

Paper's stated contribution presenting theory and conceptual groundwork; no empirical validation provided in the abstract.

high mixed The LLM Fallacy: Misattribution in AI-Assisted Cognitive Wor... interaction between augmented cognitive performance and changes in self-percepti...

The LLM fallacy has implications for education, hiring, and AI literacy.

Implications and argumentation presented in the paper; these are prospective and conceptual rather than supported by empirical data in the abstract.

high mixed The LLM Fallacy: Misattribution in AI-Assisted Cognitive Wor... impacts on education practices, hiring decisions, and AI literacy needs

Removing safety layers made the system less useful: structured validation feedback guided the model to correct outcomes in fewer turns, while the unconstrained system hallucinated success.

Qualitative and quantitative comparisons from the deployed evaluation across the three conditions (observations about turn counts, validation-feedback loops, and model hallucinations in unconstrained condition over the 25 scenario trials).

high mixed Bounded Autonomy for Enterprise AI: Typed Action Contracts a... number of interaction turns to correct outcome; presence of hallucinated success

AI plays a dual role as enhancer and eroder, simultaneously strengthening performance while eroding underlying expertise (the 'AI-as-Amplifier Paradox').

Framing claim presented in the paper's conceptual argument and grounded by the paper's stated year-long empirical study among cancer specialists (no numerical sample size reported in abstract).

high mixed From Future of Work to Future of Workers: Addressing Asympto... preservation of underlying expertise vs. short-term performance

Although some frontier models exceed human performance, model accuracy is still far below what would enable reliable experimental guidance.

Paper reports instances where top-performing (frontier) models outperform aggregate human expert accuracy on SciPredict, but concludes overall accuracies are insufficient for reliable experimental guidance.

high mixed SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... prediction_accuracy / usability_for_guidance

The local labor market will follow a dual trajectory: low-skill, routine jobs face high automation risk while demand will rise for AI-collaborative, higher-skill roles.

Paper's analytical prediction based on distinguishing current job roles into routine/repetitive vs cognitive/non-routine and projecting likely impacts; no numeric forecasts or sample sizes provided in the excerpt.

high mixed PREDICTING THE FUTURE OF JOBS IN NAGPUR DISTRICT MIDC: THE R... combined job displacement for routine roles and increased demand for AI-collabor...

Subjectivity persisted in AI-powered recruitment decisions; human judgment remained an important factor.

Theme 2 (subjectivity in AI-powered recruitment) from interviews indicating retained human subjectivity and judgement in recruitment processes (n = 22).

high mixed The augmented recruiter: examining AI integration and decisi... degree_of_subjectivity_in_decision_making

Sensitivity analyses indicate the observed positive belief changes likely reflect recovery from carry-over effects rather than genuine training-induced shifts.

Authors' sensitivity analyses discussed in the paper that examined alternative explanations (e.g., carry-over effects) and concluded the belief-change result is likely due to recovery from such effects.

high mixed Scaffolding Human-AI Collaboration: A Field Experiment on Be... validity of belief-change effect (source attribution: training vs. carry-over re...

We ran two large preregistered experiments (N=17,950 responses from 14,779 people) using conversational AI models to persuade participants on a range of attitudinal and behavioural outcomes, including signing real petitions and donating money to charity.

Statement in paper reporting two preregistered experiments, sample sizes (17,950 responses; 14,779 people), use of conversational AI models, and target outcomes including petition signing and charitable donations.

high mixed Artificial intelligence can persuade people to take politica... use of conversational AI to persuade participants on attitudinal and behavioral ...

Bounded agents act as an amplifying but not necessary extension to the foundation-model stack for changing work coordination.

Conceptual argument within the paper distinguishing bounded agents from the core stack; no empirical comparison or measurement reported.

high mixed Remote-Capable Knowledge Work Should Default to AI-Enabled F... role of bounded agents in amplifying coordination impacts

The effects of generative AI on work and organisations are heterogeneous and context-dependent, shaped by job roles, skill levels, and institutional environments.

Synthesis across the included studies noting variation in outcomes conditional on role, skill, and institutional context.

high mixed Generative AI in the Workplace: A Systematic Review of Produ... heterogeneity of AI effects across roles/skills/institutions

Although the concurrent paradigm performs worse than the sequential paradigm in terms of immediate task performance, it is more effective in promoting users' emotional trust.

Comparison between concurrent and sequential AI-assisted decision-making paradigms in the RCT (N=120); authors report concurrent < sequential for immediate task performance, but concurrent > sequential for emotional trust.

high mixed How AI-Assisted Decision-Making Paradigms and Explainability... immediate task performance (negative) and emotional trust (positive)

AI adoption outcomes depend on organizational routines, data arrangements, accountability structures, and public values.

Empirical and theoretical literature review and argument in the article drawing on scholarship in digital government and public-sector technology adoption.

high mixed Governing frontier general-purpose AI in the public sector: ... determinants of AI adoption in government (organizational, data, accountability,...

Qualitative results underscored both perceived benefits in comprehension and challenges when interpretations of gaze behaviors were inaccurate.

Qualitative analysis of participant feedback from the study (n=36) reporting themes of improved comprehension and occasional problems when the assistant misinterpreted gaze.

high mixed From Gaze to Guidance: Interpreting and Adapting to Users' C... participant-reported benefits and challenges (qualitative themes)

The productivity decomposition classifies deployments into five regimes that separate beneficial adoption from harmful adoption and identifies which deployments are vulnerable to the augmentation trap.

Model-based taxonomy produced from the analytical decomposition (classification into five regimes described in the paper).

high mixed The Augmentation Trap: AI Productivity and the Cost of Cogni... classification of AI deployment regimes (beneficial vs harmful, vulnerability to...

Small differences in managerial incentives can determine which skill path a worker takes (whether they realize full potential or deskill).

Comparative statics / theoretical sensitivity analysis in the dynamic model indicating tipping behavior based on managerial incentives.

high mixed The Augmentation Trap: AI Productivity and the Cost of Cogni... worker skill trajectory contingent on managerial incentives

Result 3: When AI productivity depends less on worker expertise, workers can permanently diverge in skill: experienced workers realize their full potential while less experienced workers deskill to zero.

Analytical result from the dynamic model showing path-dependent divergence in skill levels under particular parameterizations (lower dependence of AI on worker expertise).

high mixed The Augmentation Trap: AI Productivity and the Cost of Cogni... long-run worker skill distribution (experienced vs less experienced)

Mathematics (SAFI: 73.2) and Programming (71.8) receive the highest automation feasibility scores; Active Listening (42.2) and Reading Comprehension (45.5) receive the lowest.

SAFI benchmark results reported for specific O*NET skills (numerical SAFI scores provided in the paper).

high mixed The AI Skills Shift: Mapping Skill Obsolescence, Emergence, ... SAFI score by skill (automation feasibility)

The rise of agentic AI development, where LLM-based agents autonomously read, write, navigate, and debug codebases, introduces a new primary consumer with fundamentally different constraints.

Conceptual claim argued in the paper; refers to the emergence of agentic LLM-based tools as new consumers of software artifacts rather than an empirical measurement; no sample size reported.

high mixed Beyond Human-Readable: Rethinking Software Engineering Conve... who/what is the primary consumer of software engineering artifacts (human develo...

Analysis uncovers dramatic asymmetries: inhibition 17.6% vs. preference 75.0%.

Paper reports specific aggregated percentages for two types of implicit effects (inhibition and preference) observed in their analysis; methodology context implies these are results from the benchmark evaluation (300 items / 17 models).

high mixed ImplicitMemBench: Measuring Unconscious Behavioral Adaptatio... rates of inhibition vs. preference effects (implicit memory outcomes)

The effects of generative AI depend not only on the technology itself, but also the behavioral strategies and incentive structures surrounding its use.

Synthesis and interpretation of RCT results showing interactions between incentive structure and AI-use patterns (no formal interaction coefficients or sample details provided in excerpt).

high mixed Incentives shape how humans co-create with generative AI impact of incentives and strategies on AI outcomes

Through a pre-registered randomized control trial, we show that incentives mediate AI's homogenizing force in a creative writing task where participants can use AI interactively.

Pre-registered randomized controlled trial (experimental design) conducted on a creative writing task with interactive AI use (details such as sample size not provided in excerpt).

high mixed Incentives shape how humans co-create with generative AI extent to which incentives alter AI's homogenizing effect (mediating effect)

By conceptualizing the emergence of a posthuman economy, this study contributes to interdisciplinary debates on artificial intelligence, digital capitalism, and the transformation of economic organization.

Author-stated contribution of the paper based on conceptual/theoretical work; no empirical validation reported.

high mixed Algorithmic Agency and the Posthuman Economy: Artificial Int... conceptual contribution to interdisciplinary academic debates on AI and economic...

Contemporary organizations operate within hybrid intelligence environments where human expertise and algorithmic systems collaboratively produce economic knowledge, prediction, and action.

Theoretical synthesis using posthumanist and socio-technical perspectives within the paper; no empirical measurement or sample provided.

high mixed Algorithmic Agency and the Posthuman Economy: Artificial Int... presence of hybrid intelligence environments and collaborative human-algorithmic...

This article develops the concept of algorithmic agency to explain how artificial intelligence participates in economic decision-making within modern business systems.

Author's conceptual contribution described in the paper (theoretical development), no empirical testing reported.

high mixed Algorithmic Agency and the Posthuman Economy: Artificial Int... conceptual account of AI participation in economic decision-making (algorithmic ...

Emerging posthumanist scholarship suggests a deeper transformation in which economic agency itself becomes distributed across human and algorithmic actors.

Synthesis of posthumanist scholarship and theoretical literature cited in the paper; conceptual rather than empirical evidence.

high mixed Algorithmic Agency and the Posthuman Economy: Artificial Int... distribution of economic agency across human and algorithmic actors

Artificial intelligence is fundamentally reshaping contemporary economic systems as algorithmic infrastructures increasingly participate in interpreting information, generating predictions, and influencing organizational decision-making.

Conceptual argument in the paper drawing on posthumanist theory, socio-technical research, and digital economy scholarship; no empirical sample or quantitative data reported.

high mixed Algorithmic Agency and the Posthuman Economy: Artificial Int... extent to which algorithmic infrastructures participate in organizational inform...

These results suggest the need for AI model development to prioritize scaffolding long-term competence alongside immediate task completion.

Authors' policy/research recommendation based on experimental findings showing short-term gains but longer-term harms.

high mixed AI Assistance Reduces Persistence and Hurts Independent Perf... recommendation for AI development priorities (design objective, not an empirical...

These effects are observed across a variety of tasks, including mathematical reasoning and reading comprehension.

Trials included multiple task types (explicitly naming mathematical reasoning and reading comprehension); cross-task analysis reported.

high mixed AI Assistance Reduces Persistence and Hurts Independent Perf... task-specific performance and persistence across task types (math reasoning, rea...

« Prev 1 2 3 4 … 103 104 Next »