Evidence (6491 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
Using a minimal general-equilibrium model with autonomy-conditioned welfare, welfare-status assignment, delegation accounting, and verification institutions, we set out conditions for which autonomy-complete competitive equilibrium is autonomy-Pareto efficient.
Formal theoretical development and derivation in a minimal general-equilibrium model described in the paper (mathematical/modeling evidence; no empirical sample).
The First Fundamental Theorem ought to be subject to an autonomy qualification where the impact of changes in autonomy assumptions is incorporated.
Normative prescription based on the paper's conceptual critique and modeling agenda; supported by theoretical reasoning rather than empirical testing.
Results indicate that AI-assisted text-to-model methods can substantially lower the cost of constructing structured procedural representations, making course-wide deployment of structured AI coaching systems practically feasible.
Conclusion drawn from reported results (e.g., time reductions and modeled outputs); the paper claims that these results imply lower costs and practical feasibility for course-wide deployment.
AI-assisted authoring reduced expert modeling time by 50–70% while producing structurally valid and highly reproducible models under fixed-input conditions.
Quantitative claim reported in the paper comparing expert modeling time with AI assistance and reporting structural validity and reproducibility under fixed-input conditions; exact experimental setup and sample size not stated in the abstract.
We apply the pipeline to instructional materials from a graduate-level online AI course, constructing 23 procedural skill models.
Empirical application reported in the paper: the pipeline was run on course materials and produced 23 models (number explicitly stated).
The approach automates structural scaffolding while preserving expert oversight for validating causal transitions and failure conditions.
Claim about system design and human-in-the-loop workflow reported in the paper; implies human validation steps are maintained alongside automated generation.
We present a human-in-the-loop text-to-model pipeline that uses large language models to transform instructional materials into schema-complete Task-Method-Knowledge models via ontology-constrained prompting and template-based generation.
Methodological contribution described in the paper; pipeline design and implementation reported (no separate quantitative validation in this sentence).
When unfairness is driven by uncertainty (rather than incidental noise), accounting for uncertainty is essential to achieving fair and effective decision-making.
Synthesis/argument based on formalization and simulation experiments showing cases where uncertainty causes unfair outcomes and methods that account for uncertainty mitigate those outcomes.
The proposed framework can help practitioners diagnose, audit, and govern fairness risks in socio-technical decision systems.
Authors propose a diagnostic/audit/governance framework (conceptual contribution) and illustrate its use through examples and simulations; no field deployment evidence provided in the abstract.
Algorithmic examples in the paper demonstrate it is possible to reduce outcome variance for disadvantaged groups while preserving institutional objectives such as expected utility.
Algorithmic examples and simulation experiments reported in the paper demonstrating reductions in outcome variance for disadvantaged groups together with preserved expected utility (results from synthetic/simulated data and model runs).
The authors formalize model and feedback uncertainty using counterfactual logic and reinforcement learning.
Paper describes formalization/mathematical definitions linking counterfactual logic and reinforcement learning to model and feedback uncertainty (theoretical/methodological contribution).
This paper introduces a taxonomy of uncertainty in sequential decision-making consisting of three types: model uncertainty, feedback uncertainty, and prediction uncertainty.
Paper presents a conceptual taxonomy and names the three uncertainty types in the text/abstract; theoretical exposition in the methods/definitions sections (no external empirical sample required).
Humble leadership indirectly alleviates the negative indirect effect of HAI-C task complexity on work engagement by enhancing employees' AI self-efficacy.
Reported moderated mediation/conditional process findings from hierarchical regression and bootstrapping on the three-wave matched sample of 497 employees.
AI self-efficacy mitigates (buffers) the negative indirect impact of HAI-C task complexity on employees' work engagement.
Moderated mediation analysis conducted on longitudinal survey data (n=497) using hierarchical regression and bootstrapping; reported in Results that AI self-efficacy weakens the negative indirect effect.
HAI-C task complexity increases employees' HAI-C tech-learning anxiety.
Longitudinal survey data (n=497) analyzed with hierarchical regression; reported as a finding in the Results that task complexity amplifies tech-learning anxiety.
When models err, their incorrect predictions disproportionately lean intervention-oriented.
Error analysis of model predictions showing that among incorrect predictions, a larger share favor intervention-oriented causal signs than market-oriented ones (directional skew in errors).
Across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones.
Model-by-model accuracy comparison broken down by whether the empirically verified causal sign aligns with intervention-oriented vs market-oriented expectations; observed higher accuracy for intervention-aligned cases in 18/20 models.
GenAI-related benefits are likely to materialize only when AI capabilities are embedded in standardized routines, integrated data infrastructures, and cross-functional governance arrangements (organizational embedding).
Paper's synthesized process model and interpretive case evidence from the three firms indicating organizational conditions required for observed/documented AI effects.
GenAI-related capabilities enhance analysis by translating complex data into more interpretable, scenario-sensitive, and action-oriented outputs (analytical augmentation).
Interpretive finding from analysis of disclosures and literature; presented as a second linked mechanism through which GenAI may influence management accounting.
GenAI-related capabilities broaden the informational basis of management accounting by making operational, service, quality, and ecosystem data more usable in planning and control (information enrichment).
Interpretive inference from corporate disclosures of the three firms and review of AI-and-accounting literature; described as a primary mechanism in the paper.
Meaningful human oversight of AI agents in knowledge work requires not improved post-hoc review mechanisms, but active participation in decisions as they are made.
Authors' conclusion drawn from the formative (N=8) and summative (N=16) studies and associated observations.
Users reported a sense of co-ownership over the resulting output.
Participant self-reports from the formative and/or summative studies (authors report users expressed co-ownership of outputs when participating in execution).
Users detected errors that post-hoc review would have failed to surface.
Empirical observation reported from the studies (authors report that active participation allowed users to detect errors that would be missed by post-hoc review).
Users identified their own intent reflected in the agent's actions.
Reported participant observations/self-reports from the formative (N=8) and/or summative (N=16) studies; claim presented as a finding of the evaluations.
A formative study (N = 8) and a within-subjects summative evaluation (N = 16) comparing Pista to a baseline agent demonstrated that active participation in execution influenced not only task outcomes but also users' comprehension of the task, their perception of the agent, and their sense of role within the workflow.
Empirical evaluation consisting of a formative study with N=8 and a within-subjects summative evaluation with N=16 comparing Pista to a baseline agent (authors report influence on task outcomes, comprehension, perception, and role).
We introduce Pista, a spreadsheet AI agent that decomposes execution into auditable, controllable actions, providing users with visibility into the agent's decision-making process and the capacity to intervene at each step.
System description / design contribution presented by the authors (implementation description rather than empirical evidence).
Selective forgetting should be considered a fundamental capability for next-generation LLM agents operating in real-world, resource-constrained scenarios.
Conclusion/argument in paper based on conceptual analysis and reported empirical benefits.
The work bridges cognitive neuroscience (hippocampal indexing/consolidation theory and Ebbinghaus forgetting curve) and AI systems to inform forgetting mechanisms.
Claimed theoretical grounding and cross-disciplinary framing in paper (stated in abstract).
Empirical results show security performance with 100% elimination of security risks.
Reported experimental result in abstract claiming full elimination of security risks.
Empirical results show content quality improved by +29.2% signal-to-noise ratio.
Reported experimental result in abstract (signal-to-noise ratio improvement).
Empirical results show access efficiency improved by +8.49%.
Reported experimental result in abstract.
Building on advances in LLM agent architectures and vector databases, the paper presents detailed specifications, implementation strategies, and empirical validation from controlled experiments.
Methodological claim in abstract indicating implementation and controlled experiments (no experimental details in abstract).
Selective forgetting improves security through active forgetting of malicious inputs, sensitive data, and privacy-compromising content.
Authors' taxonomy and safety-triggered forgetting mechanism; abstract reports empirical security performance (100% elimination of security risks).
Selective forgetting improves content quality by dynamically updating outdated preferences and context.
Conceptual claim supported by authors' implementation and empirical validation; abstract reports content quality improvement (signal-to-noise ratio).
A well-designed forgetting mechanism improves efficiency via intelligent memory pruning.
Claim supported by authors' framework and controlled experiments reported in the paper (abstract references empirical results for access efficiency).
In resource-constrained environments, a well-designed forgetting mechanism is as crucial as remembering.
Argument and conceptual analysis in paper; motivated by theoretical considerations and (claimed) empirical validation.
The findings point to a staged progression of AI utility from low-consequence assistance toward higher-order automation, as trust, infrastructure, and verification mature.
Synthesis of interview responses (over 30) indicating current use cases are lower-risk assistance and that stakeholders expect (or prefer) gradual progression toward automation contingent on trust/infrastructure/verification improvements.
Reliability, verification, and auditability are central requirements for adoption, driving human-in-the-loop frameworks and governance aligned with existing engineering reviews.
Consistent themes from interviews (over 30) indicating stakeholders prioritize reliability, verifiability, and audit trails, leading to preference for human-in-the-loop designs integrated with current review processes.
Higher-value agentic gains come from orchestrating multi-step workflows across tools.
Observed and reported in interviews (over 30) with stakeholders in engineering and manufacturing workflows describing value from agentic orchestration across tools.
Near-term AI gains cluster around structured, repetitive work and data-intensive synthesis.
Qualitative findings from an exploratory state-of-practice study based on over 30 semi-structured interviews across four stakeholder groups (large enterprises, small/medium firms, AI developers, and CAD/CAM/CAE vendors).
‘Smarter’ AI agents are more profitable.
Measured profits earned by agents of different capability levels in the trading experiment and observed higher profits for higher-capability ('smarter') agents.
‘Smarter’ AI agents perform better at information aggregation.
Experimental comparison of AI agents with different capability levels ('smarter' vs. less smart) in the trading experiment; measured aggregation via log error of last price and found better performance for higher-capability agents.
Prediction markets are robust to cheap talk, market duration, initial price, and strategic prompting.
Synthesis of experimental results showing no change in aggregation performance across manipulations (cheap talk, duration, initial price, strategic prompting).
The median market is effective at aggregating information in the easy information structures.
Controlled laboratory experiment in which AI agents traded in prediction markets after receiving private signals; information aggregation measured by the log error of the last price; comparison across 'easy' information structures using median-market outcomes.
SWE-chat is a living dataset; our collection pipeline automatically and continually discovers and processes sessions from public repositories.
Description of the dataset collection infrastructure and pipeline provided in the paper; operational behavior asserted by authors.
The dataset currently contains 6,000 sessions, comprising more than 63,000 user prompts and 355,000 agent tool calls.
Descriptive statistics reported by the authors based on their dataset collection pipeline (dataset metadata).
We present SWE-chat, the first large-scale dataset of real coding agent sessions collected from open-source developers in the wild.
Paper authorship / dataset description; dataset curated and presented by the paper as a contribution. No external validation provided in excerpt.
Statelessness is the load-bearing property explaining enterprises' preference for weaker but replayable retrieval pipelines, and DPM demonstrates this property is attainable without the decisioning penalty retrieval pays.
Synthesis/conclusion based on theoretical argument and empirical results presented (architectural analysis + experiments showing DPM performance and auditability).
The audit surface follows the same one-versus-N pattern: DPM logs two LLM calls per decision while summarization logs 83-97 on LongHorizon-Bench.
Empirical measurement on LongHorizon-Bench reported in the paper: logged LLM calls per decision are 2 for DPM vs 83-97 for summarization.
DPM is additionally 7-15x faster at binding budgets, making one LLM call at decision time instead of N.
Empirical runtime/efficiency measurement reported in the paper (range 7-15x speedup) comparing number of LLM calls and latency under tight memory budgets.