Evidence (6491 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
This tension reveals a pattern we call 'bounded delegation': developers wanted AI to absorb the assembly work surrounding their craft, never the craft itself.
Interpretive result from the paper's qualitative thematic analysis of survey responses (n=860), labeled by the authors as the 'bounded delegation' pattern.
Developers wanted systems enforcing explicit authority scoping, provenance, uncertainty signaling, and least-privilege access throughout.
Reported constraints and desiderata from the thematic analysis of survey responses (n=860).
Developers wanted systems that embed quality signals earlier in their workflow to keep pace with accelerating code generation.
Thematic findings from the paper's human-in-the-loop, multi-model council-based analysis of survey responses (n=860).
Using a human-in-the-loop, multi-model council-based thematic analysis, we identify 22 AI systems that developers want built across five task categories.
Qualitative analysis method described in the paper applied to the survey responses (n=860); result reported as identification of 22 desired AI systems organized into five categories.
BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility.
Design claim in abstract describing the benchmark's automated scoring system and rubric size (100+ criteria) defined by expert bankers.
For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict
Explicit reproducibility statement and URL provided in the paper.
SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process?
Statement of research goals and scope in the paper introducing the SciPredict benchmark and accompanying evaluations.
Human experts demonstrate strong calibration: their accuracy increases from ≈5% to ≈80% as they deem outcomes more predictable without conducting the experiment.
Reported stratified accuracy of human experts on SciPredict tasks by self-reported predictability judgments; accuracy rises from ≈5% (when judged not predictable) to ≈80% (when judged predictable).
We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry.
Construction of the SciPredict benchmark described in the paper; explicitly reports 405 tasks and 33 sub-fields.
The future of Nagpur's industrial belt depends not on resisting automation, but on an aggressive reskilling strategy to bridge the gap between current workforce capabilities and future technological requirements.
Normative policy conclusion in the paper recommending reskilling as the primary response; based on the paper's analysis of task changes and projected role shifts; no program evaluation or empirical evidence of reskilling effectiveness reported in the excerpt.
There is a projected surge in demand for 'AI-collaborative' roles such as machine maintenance, data supervision, and process optimization.
Projection in the paper based on analysis of task complementarities between humans and AI, listing specific roles expected to grow; no quantitative demand estimates or sample sizes provided in the excerpt.
A configuration-driven domain model means deploying a new institutional decision domain requires YAML configuration, not engineering capacity.
Design/implementation claim in paper describing deployment approach using YAML configuration rather than engineering work.
We introduce governability — how reliably a system knows when it should not act autonomously — as a primary evaluation axis for institutional AI alongside accuracy.
Conceptual contribution/metric proposed by authors in paper; no empirical validation reported in the excerpt.
Cognitive Core produced zero silent errors while both baselines produced 5-6 silent errors on the evaluation set.
Empirical benchmark reported in paper on the 11-case evaluation set; counts of silent errors given for Cognitive Core and baselines.
Cognitive Core achieves 91% accuracy on the 11-case prior authorization appeal set, versus 55% for ReAct and 45% for Plan-and-Solve.
Empirical benchmark reported in paper on the 11-case evaluation set; accuracies explicitly stated for three systems.
We propose Cognitive Core: a governed decision substrate built from nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), a four-tier governance model where human review is a condition of execution rather than a post-hoc check, a tamper-evident SHA-256 hash-chain audit ledger endogenous to computation, and a demand-driven delegation architecture supporting both declared and autonomously reasoned epistemic sequences.
Design/proposal described in paper (architectural specification); no empirical evaluation reported for the architecture itself in the excerpt.
Organisations should invest in customisation capabilities for AI recruitment tools, implement comprehensive change management strategies, and maintain robust post-hire evaluation procedures.
Authors' recommendations derived from thematic findings and participant perspectives across two firms (qualitative synthesis of n = 22 interviews).
AI functioned optimally as an augmentative technology rather than as a replacement for human decision-makers in recruitment.
Findings: participants across the two case firms described AI being most effective when augmenting human judgment rather than replacing it (interviews n = 22).
AI significantly enhanced efficiency through process standardisation and automation.
Findings based on participant accounts in thematic analysis (interviews n = 22) describing process optimisation and automation benefits.
Participants in the treatment conditions showed greater positive belief change about the AI across the session.
Pre/post measures of participant beliefs collected during the field experiment (N=388) showing larger positive shifts among those assigned to treatment conditions versus controls.
A cognitive scaffolding intervention (partnership training that reframed AI as a thought partner) was associated with higher individual document quality at the top of the distribution.
Field experiment with 388 employees comparing cognitive scaffolding to other conditions; reported improvements concentrated at the top of the individual document-quality distribution.
LLMs coordinate extremely well on similar actions.
Empirical observation from the experiment showing high coordination performance by LLMs when alignment on similar actions is the equilibrium; qualitative description in the abstract without reported quantitative metrics.
Like humans, [LLMs] regulate [action similarity] in response to coordination incentives (strategic monoculture).
Empirical claim based on experimental results comparing how humans and LLMs change similarity when incentives for coordination/divergence are manipulated. No numerical details in excerpt.
LLMs exhibit high levels of baseline similarity (primary monoculture).
Empirical observation from the experiment comparing baseline action similarity across LLM subjects (relative level described qualitatively in paper). Specific sample sizes and quantitative metrics not provided in the excerpt.
We implement a simple experimental design that cleanly separates these forces, and deploy it on human and large language model (LLM) subjects.
Methodological claim: authors report implementing an experiment that separates baseline similarity from strategic adjustments and applying it to human participants and LLM agents. No sample sizes or procedural details provided in the excerpt.
We distinguish primary algorithmic monoculture -- baseline action similarity -- from strategic algorithmic monoculture, whereby agents adjust similarity in response to incentives.
Conceptual/theoretical distinction proposed in the paper (definition and taxonomy introduced by the authors). No empirical sample size reported for this conceptual claim in the provided text.
In a test of eight behavioural persuasion strategies, all outperformed the most effective attitudinal persuasion strategy, but differences among the eight were small.
Experimental comparison within the preregistered studies of eight behavioural persuasion strategies versus the best attitudinal persuasion strategy; results reported in paper showing each behavioural strategy exceeded the attitudinal strategy and that variation among the eight behavioural strategies was small.
We replicated prior findings that information provision drove effects on attitudes.
Experimentally manipulating information provision within the preregistered studies and observing effects on attitudinal outcomes, consistent with prior literature (sample reported in paper).
We found sizable AI persuasion effects on these behavioural outcomes (e.g. +19.7 percentage points on petition signing).
Experimental results from the two preregistered studies reported in the paper; example effect explicitly reported as +19.7 percentage points increase in petition signing. Overall sample reported as N=17,950 responses.
Organizations that strategically invest in blended, context-rich, and partnership-based development programs position themselves for sustainable competitive advantage in an increasingly automated marketplace.
Normative recommendation supported by the paper's synthesis of theory and practice (organizational development, adult learning, workforce development); no empirical effect sizes or sample-size-based evaluation provided.
Forward-thinking organizations are redesigning learning architectures to cultivate irreplaceable human capabilities that complement rather than compete with AI systems.
Synthesis of literature from organizational psychology, adult learning theory, and workforce development practice cited in the paper; presented as descriptive statement about current organizational practice rather than based on a reported empirical study with sample size.
Corporate and academic learning ecosystems will converge (necessary convergence of corporate and academic learning ecosystems).
Conceptual synthesis and argumentation in the paper referencing workforce development practice and organizational development research; no quantitative measures or sample size reported.
Human skills (critical thinking, adaptive decision-making, interpersonal acumen) will be elevated to core competency status as AI automates technical tasks once considered core competencies.
Argument and synthesis presented in the paper drawing on organizational psychology, adult learning theory, and workforce development practice; no empirical sample size or statistical tests reported (conceptual/literature-based claim).
A machine-learning research agenda is needed centered on team-level evaluation, privacy-preserving memory layers, scaffolded AI for learning, carbon-aware routing, and pro-agency workflow design.
Prescriptive recommendation in the position paper proposing specific research priorities; no empirical evaluation of these approaches is presented within the paper itself.
Rather than eliminating the office, this shift supports selective co-presence, reserving in-person time for tasks with high tacitness, high coupling, or high relational stakes (including apprenticeship, conflict repair, trust formation, and early-stage synthesis).
Theoretical/qualitative argument about task types best suited for in-person interaction; illustrated by examples (apprenticeship, conflict repair, trust formation, early-stage synthesis); no empirical task-level allocation study presented.
Capabilities that are already widely deployed—transcription, summarization, retrieval, translation, drafting, and code assistance—are the basis for this shift (with bounded agents as an amplifying but not necessary extension).
Descriptive claim citing the prevalence of specific AI capabilities in current deployments; presented as observation in the position paper rather than as a quantified adoption study.
The organizational significance of these systems is not generic automation but the accumulation of artifact capital: durable, queryable, reusable traces such as transcripts, summaries, decisions, tickets, code comments, and retrieval layers.
Argumentative claim in the paper describing a conceptual mechanism ('artifact capital') by which foundation-model features create reusable organizational artifacts; no empirical measurement of artifact capital provided.
The foundation-model stack (NL interaction, multimodal capture, long context, retrieval, transcription, translation, bounded tool use) changes the coordination economics that previously favored daily in-person co-presence.
Conceptual claim supported by descriptions of foundation-model capabilities and their potential to create durable, queryable artifacts; no empirical test or measured coordination-costs reported.
Remote-capable knowledge work should default to AI-enabled flexibility because the workflow-integrated foundation-model stack changes the coordination economics that once favored daily co-presence.
Normative argument in the position paper based on conceptual analysis of coordination economics and the claimed effects of foundation-model features; no empirical sample or quantitative study reported.
Preliminary corroboration is provided by a companion production automation system with eleven operating lanes and 2,132 classified tickets.
Reported companion system operational statistics in the paper (11 lanes, 2,132 tickets).
When iteration was permitted, the final success rate for the structured interactions reached 91.5% (183 of 200).
Reported final success counts/rate in the paper for structured interactions (183 of 200).
Among structured interactions, 110 of 200 were accepted on first pass.
Reported counts in the paper for the structured-interaction group (110 accepted of 200 structured interactions).
Structured context assembly was associated with an improvement in first-pass acceptance from 32% to 55%.
Observational comparison reported in the paper (baseline vs. structured first-pass acceptance rates are given as 32% and 55%).
Structured context assembly was associated with a reduction from 3.8 to 2.0 average iteration cycles per task.
Observational comparison reported in the paper (structured vs. baseline interactions); the paper states the 3.8 to 2.0 cycle figures.
The paper applies formal models from reliability engineering and information theory as post hoc interpretive lenses on context quality.
Paper text claiming the application of these formal models for interpretation.
Context Engineering applies a staged four-phase pipeline (Reviewer to Design to Builder to Auditor).
Methodological description in the paper listing the four pipeline phases.
Context Engineering defines a five-role context package structure (Authority, Exemplar, Constraint, Rubric, Metadata).
Explicit specification in the paper of the five-role package components.
This paper introduces Context Engineering, a structured methodology for assembling, declaring, and sequencing the complete informational payload that accompanies a prompt to an AI tool.
Methodological description in the paper (definition and presentation of the Context Engineering approach).
The review integrates fragmented literature into a cohesive framework and offers implications for managers and policymakers to pursue more balanced, inclusive, and context-sensitive AI adoption strategies.
Author-stated contribution of the review based on synthesis of the 40 included studies; normative recommendations derived from the review.
Generative AI adoption is associated with mixed employee perceptions: some studies report increased efficiency and higher job satisfaction.
Aggregate finding from included studies in the review that report positive employee-reported outcomes (efficiency, satisfaction).