Evidence (6491 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
The spectrum focuses attention on where leadership work occurs: who frames the problem, who redirects the work, and who can answer for what follows.
Conceptual argument in the paper describing the axes/criteria of the spectrum (theoretical/thematic analysis; no empirical data reported).
This paper offers a leadership-facing spectrum to see human–AI decision relationships with five positions: Pure Human, Centaur (human-dominant, with AI in the loop), Co-equal, Minotaur (AI-dominant, with humans in the loop), and Pure AI.
Conceptual presentation in the paper: a theorized five-position spectrum (no empirical sample or experiment reported).
The paper formalizes these limitations, addresses four alternative views, and proposes a co-existence solution plus a call to action for system builders, benchmark designers, and the memory community.
Meta-claim about the paper's content: formalization, rebuttals, and recommendations stated in the abstract; no empirical sample reported in abstract.
Complementary Learning Systems (CLS) theory shows biological intelligence solved this problem by pairing fast hippocampal exemplar storage with slow neocortical weight consolidation.
Appeal to established neuroscience theory (CLS); the paper draws on CLS literature to justify the two-system solution in biology; no new empirical sample reported in abstract.
AI product builders should recognize that they are designing not just model behavior but user behavior; encouraging deep engagement, rather than friction-free experiences, will lead to more success overall.
Policy/design recommendation based on the paper's analyses of 27K annotated transcripts showing links between user fluency, engagement patterns, failure visibility, recovery, and success.
Individuals should adopt a stance of active engagement rather than passive acceptance.
Interpretive recommendation derived from observed differences in outcomes by user fluency in the 27K annotated transcript analysis (paper’s discussion/recommendation section).
Fluent users' failures are more likely to lead to partial recovery.
Analysis of conversation trajectories in the 27K annotated transcripts showing higher incidence of partial recovery (follow-up iterations leading to partial fix) after failures by fluent users.
Fluent users' failures tend to be visible (a direct consequence of their engagement).
Annotations of failure visibility within the 27K transcripts, comparing frequency of visible vs. invisible failures across fluency levels.
Fluent users take on more complex tasks than novices.
Observational analysis of a richly annotated sample of 27,000 transcripts drawn from the WildChat-4.8M dataset; transcripts were annotated for user fluency and task characteristics (as reported in the paper).
Organizations should cultivate a culture of critical engagement with AI outputs, and e-leadership development must focus on building competencies in mediating, filtering and legitimizing AI contributions within digital workflows.
Recommendations based on thematic analysis of interview data across 34 project managers; presented as implications rather than empirically tested interventions.
To achieve balanced augmentation, leaders must proactively frame AI's role, embedding validation checkpoints and human authorship clauses to maintain accountability.
Prescriptive recommendation derived from thematic findings and cross-case patterns in the 34 interviews; no experimental or longitudinal testing reported.
Proactive engagement combined with creation-oriented use generated the highest effectiveness.
Qualitative coding and cross-case comparisons in the thematic analysis of 34 interviews identified combinations of proactive e-leadership and creation-oriented AI use associated with reported high team effectiveness.
The trajectory of the curvilinear relationship is governed by e-leadership practices.
Interview data analyzed thematically showing recurring references to leadership practices as moderators of AI-use effectiveness across the 34 interviews.
Based on these insights, we offer design recommendations for generative AI-powered learning tools for freelancers.
Paper contribution section — authors present design recommendations derived from study findings (not an empirical claim about an evaluated intervention).
Freelancers increasingly rely on generative AI to structure learning and support exploratory skill acquisition.
Reported finding from the paper's mixed-methods study (survey + semi-structured interviews with freelance knowledge workers).
Models across all three families acquire interpretable mechanical reasoning strategies without fine-tuning.
Observation reported for the three open-source models used in experiments (Llama 3.3 70B, Qwen3 4B, Qwen3 MoE 30B-A3B) showing emergent, interpretable mechanical reasoning during the iterative design process without any model fine-tuning.
The system correctly diagnoses underconstraint failure modes 35.6% of the time.
Reported diagnostic accuracy for underconstraint failure mode in the experimental results (35.6%).
The system correctly diagnoses overconstraint failure modes 56.3% of the time.
Reported diagnostic accuracy for overconstraint failure mode in the experimental results (56.3%).
78.6% of iterative refinement trajectories show measurable improvement.
Reported aggregate statistic from the experimental evaluation of iterative refinement trajectories (percentage improvement across trajectories).
The modular architecture improves structural validity by up to 134% over monolithic baselines.
Empirical results reported across six motion targets and three models comparing modular architecture to monolithic baselines; the paper reports an improvement in structural validity up to 134%.
The modular architecture reduces geometric error by up to 68% over monolithic baselines.
Empirical results reported across six engineering-relevant motion targets and three open-source models comparing the modular architecture to monolithic baselines; the paper states a maximum reduction of geometric error of 68%.
Language models can systematically improve linkage designs through symbolic representations.
Reported experiments using a modular architecture combining language-model agents and numerical optimisers across six engineering-relevant motion targets and three open-source models (Llama 3.3 70B, Qwen3 4B, Qwen3 MoE 30B-A3B); comparisons reported versus monolithic baselines.
The proposed framework emerged from operational work to improve clinician capability in a live value-based care deployment.
Stated as originating from operational experience in a live deployment; no details on deployment scale, sample size, or outcomes provided in the excerpt.
Training environments that combine longitudinal outcome measurement with aligned financial incentives are a necessary condition for learning a reward model aligned with patient trajectory rather than with encounter economics.
Normative/theoretical argument presented in the paper; no empirical tests or sample sizes reported in the excerpt.
Chronic disease management under outcome-based payment contracts produces override data with uniquely favorable properties for learning: longitudinal density, concentrated decision space, outcome labels, and natural capability variation.
Argument/claim in the paper that outcome-based contracts and chronic disease management produce favorable data characteristics; asserted as part of the framework motivation. No quantitative empirical evidence or sample sizes provided in the excerpt.
We propose a dual learning architecture that jointly trains a reward model and a capability model via alternating optimization, which prevents a failure mode we term 'suppression bias'—the systematic suppression of correct-but-difficult recommendations when clinician capability falls below the execution threshold.
Proposed algorithmic contribution and theoretical claim; suppression bias defined and a mitigation approach described. No empirical evaluation or sample sizes given in the excerpt.
We formulate preferences conditioned on patient state s, organizational context c, and clinician capability κ, where κ decomposes into execution capability (κ-exec) and alignment capability (κ-align).
Presented as a formal model formulation in the paper; theoretical description without empirical sample sizes in the excerpt.
We introduce a five-category override taxonomy that maps override types to distinct model update targets.
Stated as a formal contribution of the framework; taxonomy proposed in the paper. No empirical validation or sample size reported in the excerpt.
Clinician overrides of clinical AI recommendations can be reframed as implicit preference data analogous to reinforcement learning from human feedback (RLHF), but richer because the annotator is a domain expert, the alternatives carry real consequences, and downstream outcomes are observable.
Conceptual argument presented in the paper drawing an analogy to RLHF; no empirical metrics or sample size reported in the excerpt.
Scalable synthetic computer creation, together with at-scale simulations, is highly promising as a foundational substrate for agent self-improvement and agentic reinforcement learning in long-horizon productivity scenarios.
Authors' conclusion/argument based on the methods and preliminary experimental results presented in the paper (interpretive claim rather than a quantified empirical result).
Given that personas are abundant at billion scale, this methodology can in principle scale to millions or even billions of synthetic user worlds with sufficient compute, enabling broader coverage of diverse professions, roles, contexts, environments, and productivity needs.
Argumentative/theoretical scalability claim based on the abundance of personas and the scalable design of the methodology (no empirical demonstration at millions/billions scale reported).
Each run requires over 8 hours of agent runtime and spans more than 2,000 turns on average.
Reported runtime and turn-count metrics from the preliminary experiments (per-run runtime >8 hours; per-run average >2,000 turns).
In preliminary experiments, we create 1,000 synthetic computers and run long-horizon simulations on them.
Reported preliminary experiment count in the paper (explicit statement: 1,000 synthetic computers were created and simulated).
Conditioned on each synthetic computer, we run long-horizon simulations: one agent creates productivity objectives that are specific to the computer's user and require multiple professional deliverables and about a month of human work; another agent then acts as that user and keeps working across the computer ... until these objectives are completed.
Description of the two-agent simulation procedure in the paper (simulation design: objective-creating agent and user-acting agent executing tasks across the synthetic computer).
We introduce Synthetic Computers at Scale, a scalable methodology for creating such environments with realistic folder hierarchies and content-rich artifacts (e.g., documents, spreadsheets, and presentations).
Methodological description and implementation presented in the paper (design and procedures for generating synthetic computers and artifact types).
This work offers a principled foundation for autonomous AI agents that govern themselves the way humans do: not because rules are imposed upon them, but because deliberation is embedded in how they think.
Concluding claim summarizing the proposed framework's conceptual contribution (theoretical/architectural claim; not an empirical measurement).
Implemented on a production-grade retail supply chain workflow, the framework produces zero false escalations to human oversight.
Empirical implementation on a production-grade retail supply chain workflow reported in the paper (claim stated without sample size or measurement protocol in the abstract).
Implemented on a production-grade retail supply chain workflow, the framework achieves 95% compliance accuracy.
Empirical implementation on a production-grade retail supply chain workflow reported in the paper (no sample size or evaluation details provided in the abstract).
We formalize a Pre-Action Governance Reasoning Loop (PAGRL) in which agents consult a four-layer governance rule set: global, workflow-specific, agent-specific, and situational before every consequential action.
Methodological contribution described in the paper (formalization of a governance loop and four-layer rule hierarchy; no numerical sample given in the abstract).
We propose a neurocognitive governance framework that formally maps this human self-governance process to LLM-driven agent reasoning, establishing a structural parallel between the human brain and the large language model as the cognitive core of an agent.
Theoretical framework and formal mapping presented in the paper (design/proposal rather than empirical validation).
Before acting, humans engage deliberate cognitive processes grounded in executive function, inhibitory control, and internalized organizational rules to evaluate whether an intended action is permissible, requires modification, or demands escalation.
The paper's framing draws on cognitive/neurocognitive literature about human self-governance (presented as background/theoretical justification; no new empirical human-subject data reported in the abstract).
The authors conclude that these findings have implications for responsible and perceptible genAI use in hiring contexts.
Authors' conclusions/recommendations based on the interview findings and analysis.
Participants reported only marginal efficiency gains from genAI despite a seemingly seismic shift in how recruiting happens.
Self-reports from 22 interviewed recruiting professionals indicating small/marginal efficiency improvements.
Individual recruiters also felt compelled to adopt genAI because of the personal need to boost productivity.
Qualitative interview responses (n=22) reporting individual-level productivity motivations for using genAI.
Recruiters often felt compelled to adopt genAI to combat applicant use of AI.
Interview data from 22 recruiting professionals reporting adoption motivations tied to applicants' AI use.
When generative AI (genAI) systems are used in high-stakes decision-making, its recommended role is to aid, rather than replace, human decision-making.
Normative statement presented in the paper (literature/theoretical recommendation), no empirical data reported to support this recommendation within the study.
AI-mediated expert networks are an emerging phenomenon that existing coordination theories fail to account for.
Mentioned as an example in the abstract to motivate theoretical gap; no empirical data or sample provided.
GitHub Copilot exhibits 'recursive value creation' as an example of an emerging organizational phenomenon enabled by GenAI.
Illustrative example named in the abstract; no empirical measurement or sample reported within the abstract.
UCF provides a theoretical foundation for understanding organizational coordination when GenAI transforms cognitive constraints from scarce to abundant resources.
Position paper asserts UCF as foundational theory for coordination under transformed cognitive constraints; conceptual argument only.
Three emergent organizational forms illustrate UCF principles: cognitive meshworks (coordinated through competence synthesis), algorithmic ecosystems (achieving emergent optimization), and hybrid intelligence collectives (operating through cognitive complementarity).
Conceptual typology and illustrative examples in the position paper; no reported empirical measurement or sample.