Evidence (5157 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 736 | 1615 |
| Governance & Regulation | 664 | 329 | 160 | 99 | 1273 |
| Organizational Efficiency | 624 | 143 | 105 | 70 | 949 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 348 | 109 | 48 | 322 | 836 |
| Output Quality | 391 | 120 | 44 | 40 | 595 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 275 | 143 | 62 | 34 | 521 |
| AI Safety & Ethics | 183 | 241 | 59 | 30 | 517 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 105 | 40 | 6 | 187 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 78 | 8 | 1 | 151 |
| Regulatory Compliance | 69 | 64 | 14 | 3 | 150 |
| Training Effectiveness | 81 | 15 | 13 | 18 | 129 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
Uncertainty-aware exploration (in algorithms) alters fairness metrics compared to policies that ignore uncertainty.
Results from simulation experiments compare uncertainty-aware exploration policies to baseline policies and report changes in fairness metrics (as described in the abstract and results).
For LLM agents, memory management critically impacts efficiency, quality, and security.
Statement in paper framing and motivation; supported conceptually by literature linking memory design to system properties (no specific experimental details provided in abstract).
The experimental findings are consistent with the paper's theoretical predictions.
Comparison reported in the paper between theoretical model predictions and observed outcomes from the controlled AI-agent trading experiments.
Coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves.
Empirical analysis of authorship attribution across the 6,000 sessions in the SWE-chat dataset; percentages derived from session-level classification.
A determinism study of 10 replays per case at temperature zero shows both architectures inherit residual API-level nondeterminism, but DPM exposes one nondeterministic call while summarization exposes N compounding calls.
Determinism experiment with 10 replays per case at temperature zero; qualitative/quantitative observation about number of nondeterministic LLM calls exposed by each architecture.
Advanced prompting methods improve accuracy on inconclusive cases but over-correct, withholding decisions even on clear cases.
Empirical comparison of prompting methods reported in paper: advanced prompts increased accuracy on inconclusive (insufficient-information) cases but led to excessive deferral/withholding on clear cases.
Multi-agent workflows and benchmark evaluation reveal current capabilities, limitations, and research frontiers in agentic AI for physical design.
The paper states it analyzes recent experience with multi-agent workflows and benchmark evaluation; the abstract does not provide specific benchmark names, metrics, or sample sizes.
The study was a preregistered experiment across seven leading LLMs and twelve investment scenarios covering legitimate, high-risk, and objectively fraudulent opportunities.
Methodological description in the paper stating preregistration, 7 LLMs, 12 scenarios; combined dataset included 3,360 AI advisory conversations and a 1,201-participant human benchmark.
Given the results, educators should revisit pair programming as an educational tool in addition to embracing modern AI.
Authors' recommendation in the paper's conclusion based on experimental findings (performance, workload, emotion, retention outcomes).
Formal network verification has made substantial progress in proving correctness properties but is typically applied in offline, pre-deployment settings and faces challenges in accommodating continuous changes and validating live production behavior.
Authors' summary of the state of the art in network verification (assertion in paper; no empirical data in abstract).
Results also reveal divergences between the two interaction scenario types.
Abstract statement that divergences vary across different interaction contexts / scenario types.
Results reveal divergences between purely simulated and human study datasets.
Abstract reports that findings diverge between simulation experiments and the human-subjects dataset; comparisons drawn across the two datasets (simulation N=2000, human N=290).
Experienced developers maintain control through detailed delegation while novices struggle between over-reliance and cautious avoidance.
Observed behaviors and accounts from the AI-assisted debugging task (10 juniors) and senior participants in ACTA/Delphi and blind review phases (5 + 5 seniors).
AI is not just changing how engineers code—it is reshaping who holds agency across work and professional growth.
Qualitative synthesis of findings across the three-phase study (Delphi with 5 seniors; debugging task with 10 juniors; blind reviews by 5 seniors).
How software developers interact with AI-powered tools, including Large Language Models (LLMs), plays a vital role in how these AI-powered tools impact them.
Based on qualitative analysis of twenty-two interviews with software developers about using LLMs for software development; asserted as a central finding in the paper's analysis.
No aggregation mechanism can simultaneously satisfy all desiderata of collective rationality (connection to Arrow's Impossibility Theorem); multi-agent deliberation navigates rather than resolves this constraint.
Theoretical argument connecting empirical multi-agent deliberation results to Arrow's Impossibility Theorem and observations that deliberation trades off competing desiderata rather than achieving all simultaneously.
Alignment systematically shapes negotiation strategies and allocation patterns between agents.
Experimentally comparing negotiation behavior and allocation outcomes across agent pairs where one agent is aligned (via RAG) and the partner is either unaligned or adversarially prompted; patterns of strategy and allocation differences reported.
The design space articulates four configurations—No AI, Hidden AI, Translucent AI, and Visible AI—each trading off among accountability, autonomy, and coordination cost.
Conceptual taxonomy introduced in the paper (design artifact). No empirical evaluation or sample reported in the abstract; tradeoffs are argued theoretically.
CLARITI matches GPT-5's resolution rate on underspecified issues while generating 41% fewer questions.
Empirical evaluation comparing CLARITI and GPT-5 on a task set of underspecified software engineering issues; the result reported in the abstract indicates parity in resolution rate and a quantified reduction in questions (41%) but the abstract does not report sample size, test set composition, or statistical significance.
They can produce fluent outputs that resemble reflection, but lack temporal continuity, causal feedback, and anchoring in real-world interaction.
Descriptive claim made in the text contrasting surface-level fluency with missing properties; no empirical data or experiments provided.
A within-subject human study with 20 players and 600 games shows that our interventions significantly improve performance for low- and mid-skill players while matching expert-engine interventions for high-skill players.
Within-subject human experiment reported in the paper: N = 20 players, 600 games total; comparisons of performance under the proposed interventions versus expert-engine interventions.
This work establishes a foundation for understanding how generative AI systems not only augment cognitive performance but also reshape self-perception and perceived expertise.
Paper's stated contribution presenting theory and conceptual groundwork; no empirical validation provided in the abstract.
The LLM fallacy has implications for education, hiring, and AI literacy.
Implications and argumentation presented in the paper; these are prospective and conceptual rather than supported by empirical data in the abstract.
Removing safety layers made the system less useful: structured validation feedback guided the model to correct outcomes in fewer turns, while the unconstrained system hallucinated success.
Qualitative and quantitative comparisons from the deployed evaluation across the three conditions (observations about turn counts, validation-feedback loops, and model hallucinations in unconstrained condition over the 25 scenario trials).
AI plays a dual role as enhancer and eroder, simultaneously strengthening performance while eroding underlying expertise (the 'AI-as-Amplifier Paradox').
Framing claim presented in the paper's conceptual argument and grounded by the paper's stated year-long empirical study among cancer specialists (no numerical sample size reported in abstract).
Although some frontier models exceed human performance, model accuracy is still far below what would enable reliable experimental guidance.
Paper reports instances where top-performing (frontier) models outperform aggregate human expert accuracy on SciPredict, but concludes overall accuracies are insufficient for reliable experimental guidance.
The local labor market will follow a dual trajectory: low-skill, routine jobs face high automation risk while demand will rise for AI-collaborative, higher-skill roles.
Paper's analytical prediction based on distinguishing current job roles into routine/repetitive vs cognitive/non-routine and projecting likely impacts; no numeric forecasts or sample sizes provided in the excerpt.
Subjectivity persisted in AI-powered recruitment decisions; human judgment remained an important factor.
Theme 2 (subjectivity in AI-powered recruitment) from interviews indicating retained human subjectivity and judgement in recruitment processes (n = 22).
Sensitivity analyses indicate the observed positive belief changes likely reflect recovery from carry-over effects rather than genuine training-induced shifts.
Authors' sensitivity analyses discussed in the paper that examined alternative explanations (e.g., carry-over effects) and concluded the belief-change result is likely due to recovery from such effects.
We ran two large preregistered experiments (N=17,950 responses from 14,779 people) using conversational AI models to persuade participants on a range of attitudinal and behavioural outcomes, including signing real petitions and donating money to charity.
Statement in paper reporting two preregistered experiments, sample sizes (17,950 responses; 14,779 people), use of conversational AI models, and target outcomes including petition signing and charitable donations.
Bounded agents act as an amplifying but not necessary extension to the foundation-model stack for changing work coordination.
Conceptual argument within the paper distinguishing bounded agents from the core stack; no empirical comparison or measurement reported.
The effects of generative AI on work and organisations are heterogeneous and context-dependent, shaped by job roles, skill levels, and institutional environments.
Synthesis across the included studies noting variation in outcomes conditional on role, skill, and institutional context.
Although the concurrent paradigm performs worse than the sequential paradigm in terms of immediate task performance, it is more effective in promoting users' emotional trust.
Comparison between concurrent and sequential AI-assisted decision-making paradigms in the RCT (N=120); authors report concurrent < sequential for immediate task performance, but concurrent > sequential for emotional trust.
AI adoption outcomes depend on organizational routines, data arrangements, accountability structures, and public values.
Empirical and theoretical literature review and argument in the article drawing on scholarship in digital government and public-sector technology adoption.
Qualitative results underscored both perceived benefits in comprehension and challenges when interpretations of gaze behaviors were inaccurate.
Qualitative analysis of participant feedback from the study (n=36) reporting themes of improved comprehension and occasional problems when the assistant misinterpreted gaze.
The productivity decomposition classifies deployments into five regimes that separate beneficial adoption from harmful adoption and identifies which deployments are vulnerable to the augmentation trap.
Model-based taxonomy produced from the analytical decomposition (classification into five regimes described in the paper).
Small differences in managerial incentives can determine which skill path a worker takes (whether they realize full potential or deskill).
Comparative statics / theoretical sensitivity analysis in the dynamic model indicating tipping behavior based on managerial incentives.
Result 3: When AI productivity depends less on worker expertise, workers can permanently diverge in skill: experienced workers realize their full potential while less experienced workers deskill to zero.
Analytical result from the dynamic model showing path-dependent divergence in skill levels under particular parameterizations (lower dependence of AI on worker expertise).
Mathematics (SAFI: 73.2) and Programming (71.8) receive the highest automation feasibility scores; Active Listening (42.2) and Reading Comprehension (45.5) receive the lowest.
SAFI benchmark results reported for specific O*NET skills (numerical SAFI scores provided in the paper).
The rise of agentic AI development, where LLM-based agents autonomously read, write, navigate, and debug codebases, introduces a new primary consumer with fundamentally different constraints.
Conceptual claim argued in the paper; refers to the emergence of agentic LLM-based tools as new consumers of software artifacts rather than an empirical measurement; no sample size reported.
Analysis uncovers dramatic asymmetries: inhibition 17.6% vs. preference 75.0%.
Paper reports specific aggregated percentages for two types of implicit effects (inhibition and preference) observed in their analysis; methodology context implies these are results from the benchmark evaluation (300 items / 17 models).
The effects of generative AI depend not only on the technology itself, but also the behavioral strategies and incentive structures surrounding its use.
Synthesis and interpretation of RCT results showing interactions between incentive structure and AI-use patterns (no formal interaction coefficients or sample details provided in excerpt).
Through a pre-registered randomized control trial, we show that incentives mediate AI's homogenizing force in a creative writing task where participants can use AI interactively.
Pre-registered randomized controlled trial (experimental design) conducted on a creative writing task with interactive AI use (details such as sample size not provided in excerpt).
By conceptualizing the emergence of a posthuman economy, this study contributes to interdisciplinary debates on artificial intelligence, digital capitalism, and the transformation of economic organization.
Author-stated contribution of the paper based on conceptual/theoretical work; no empirical validation reported.
Contemporary organizations operate within hybrid intelligence environments where human expertise and algorithmic systems collaboratively produce economic knowledge, prediction, and action.
Theoretical synthesis using posthumanist and socio-technical perspectives within the paper; no empirical measurement or sample provided.
This article develops the concept of algorithmic agency to explain how artificial intelligence participates in economic decision-making within modern business systems.
Author's conceptual contribution described in the paper (theoretical development), no empirical testing reported.
Emerging posthumanist scholarship suggests a deeper transformation in which economic agency itself becomes distributed across human and algorithmic actors.
Synthesis of posthumanist scholarship and theoretical literature cited in the paper; conceptual rather than empirical evidence.
Artificial intelligence is fundamentally reshaping contemporary economic systems as algorithmic infrastructures increasingly participate in interpreting information, generating predictions, and influencing organizational decision-making.
Conceptual argument in the paper drawing on posthumanist theory, socio-technical research, and digital economy scholarship; no empirical sample or quantitative data reported.
These results suggest the need for AI model development to prioritize scaffolding long-term competence alongside immediate task completion.
Authors' policy/research recommendation based on experimental findings showing short-term gains but longer-term harms.
These effects are observed across a variety of tasks, including mathematical reasoning and reading comprehension.
Trials included multiple task types (explicitly naming mathematical reasoning and reading comprehension); cross-task analysis reported.