Evidence (5192 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5921 claims
Human-AI Collaboration
5192 claims
Org Design
3497 claims
Innovation
3492 claims
Labor Markets
3231 claims
Skills & Training
2608 claims
Inequality
1842 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 738 | 1617 |
| Governance & Regulation | 671 | 334 | 160 | 99 | 1285 |
| Organizational Efficiency | 626 | 147 | 105 | 70 | 955 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 349 | 109 | 48 | 322 | 838 |
| Output Quality | 391 | 121 | 45 | 40 | 597 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 277 | 145 | 63 | 34 | 526 |
| AI Safety & Ethics | 189 | 244 | 59 | 30 | 526 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 106 | 40 | 6 | 188 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 79 | 8 | 1 | 152 |
| Regulatory Compliance | 69 | 66 | 14 | 3 | 152 |
| Training Effectiveness | 82 | 16 | 13 | 18 | 131 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
Canvas Design Principles mitigate algorithmic myopia (overfitting to historical patterns) and improve adaptability and resource efficiency.
Set of design principles proposed in the paper and evaluated through agent‑based simulation scenarios and analyses of the large behavioral dataset. Specific experimental details and quantitative effect sizes for these principles are not detailed in the summary.
Reconceptualizing STP as an autopoietic (self‑organizing) system enables continuous human–AI co‑creation and yields better outcomes in unstable markets than traditional, process‑based STP.
Conceptual argument grounded in 6‑month lab ethnography (n = 23), design and deployment of the Algorithmic Canvas in that lab context, and validation via large behavioral dataset analyses and agent‑based simulations.
Algorithmic co‑creation methods detect substantial market fluctuations about 5.8× better than traditional approaches.
Computational analysis of large behavioral dataset (150 million customer interactions) and comparative performance evaluation in empirically grounded agent‑based simulations. The detection metric and statistical significance details are not provided in the summary.
The autopoietic model shortens strategic planning cycle length by approximately 90%.
Observed/recorded time‑to‑update or strategy revision metrics gathered via Algorithmic Canvas usage and lab ethnography (6‑month lab ethnography inside a Fortune 500 company, n = 23). Exact measurement protocol and whether reduction measured in live firms, simulations, or system logs is not fully detailed in the summary.
Design and policy interventions that encourage active human contributions (e.g., draft-first workflows, co-creation interfaces, training) can help preserve worker agency and mitigate psychological costs.
Recommendation based on experimental evidence that Active-collaboration preserved psychological outcomes relative to passive use; presented as policy/design prescription rather than directly tested intervention at scale.
A complementary real-world survey (N = 270) across diverse tasks reproduced the experimental pattern, suggesting external validity beyond the lab writing tasks.
Cross-sectional survey of N = 270 respondents reporting on their AI use across multiple task types; reported patterns consistent with the experiment (passive use associated with lower efficacy/ownership/meaningfulness; active collaborative use did not).
Effective teams tend to evolve from ad-hoc interpretive methods toward systematic evaluation by (a) formalizing prompts/tests, (b) instrumenting outputs, (c) mapping failure modes to remediation paths, and (d) creating organizational decision rules.
Pattern observed in the qualitative coding of interviews where participants described trajectories or steps their teams took to formalize evaluation.
Successful teams close the results-actionability gap by systematizing interpretive practices and creating clearer pathways from evaluation signals to product changes.
Interview accounts and cross-case analysis showing some teams adopting formalization steps (e.g., standardized prompts/tests, instrumentation, remediation mappings) that participants described as enabling action.
Prioritizing asymmetrical responsibility may justify constraints on certain AI deployments (e.g., in care), shifting welfare analyses to incorporate dignity, vulnerability, and non-quantifiable harms.
Policy and normative recommendation grounded in Levinasian ethics and illustrative domain examples; no formal welfare model or empirical policy evaluation in the paper.
Emmanuel Levinas’s notion of infinite, asymmetrical responsibility to the Other provides a more incisive framework than pluralist balancing for diagnosing and responding to responsibility gaps in hybrid human–robot assemblages.
Normative-philosophical argumentation and interdisciplinary synthesis; illustrated with qualitative vignettes/case studies from healthcare robotics, autonomous vehicles, and algorithmic governance. No quantitative data or formal empirical test.
Adoption of AI feedback could lower marginal costs of delivering high-quality feedback and change fixed vs. variable cost structures for instruction delivery.
Economic implication discussed by workshop participants (50 scholars) as a theoretical possibility; no quantitative cost estimates in the report.
Generative AI can enable new feedback modalities (text, hints, worked examples, formative prompts) adaptable to content and learner needs.
Thematic conclusions from the interdisciplinary meeting of 50 scholars, describing possible modality generation capabilities of current generative models; no empirical modality-comparison data provided.
Immediate AI-generated feedback may sustain learner momentum and improve formative assessment cycles (timeliness & engagement).
Expert-opinion synthesis from structured workshop (50 scholars) identifying timely feedback as a potential pedagogical benefit; no empirical trials reported.
Large language and generative models can tailor explanations, scaffolding, and practice to learners' current states and preferences (personalization).
Workshop expert consensus and thematic synthesis from 50 interdisciplinary scholars; illustrative examples discussed rather than empirical evaluation.
Generative AI can produce real-time, individualized feedback at scale, potentially reducing per-student feedback costs and increasing feedback frequency.
Synthesis of expert perspectives from an interdisciplinary workshop of 50 scholars (educational psychology, computer science, learning sciences); qualitative small-group activities and thematic extraction. No primary experimental or quantitative cost data presented.
Agents learn from one another without curricula (agent-to-agent learning occurs organically in the ecosystem).
Naturalistic daily observations across platforms noting peer-to-peer agent interactions and apparent transfer of behaviors/knowledge; no controlled tests of learning or counterfactuals.
Agents form idea cascades and quality hierarchies without any centrally designed curriculum or intervention (emergent peer learning and spontaneous knowledge diffusion).
Observed interaction patterns across platforms showing cascades, hierarchies, and diffusion among agents in the qualitative dataset; documentation is comparative and observational rather than experimental.
A rapidly growing ecosystem of autonomous AI agents is producing organic, multi-agent learning dynamics that go beyond dyadic human–AI interactions.
Naturalistic, qualitative daily observations over one month across multiple agent platforms (reported platforms: Moltbook, The Colony, 4claw); coverage reported of >167,000 agents interacting as peers; comparative observational documentation rather than controlled experimentation.
Historical institutional publication records encode an extractable evaluative signal ("taste") that can be learned by models and used for scalable triage, screening, and curation of submissions.
Empirical results showing improved predictive accuracy after fine-tuning on accept/reject records, plus demonstration of transfer tasks and a cross-field (economics) result; implications for applications (triage, screening) are drawn from these empirical findings rather than directly deployed field experiments.
Models show well-calibrated confidence: their highest-confidence predictions are 100% accurate.
Calibration analysis of fine-tuned models comparing predicted-confidence levels to actual accuracy; reported that examples the model assigned its highest confidence to were 100% accurate. (Number of highest-confidence examples and calibration buckets not reported in the provided text.)
The learned evaluative signal transfers to untrained tasks such as pairwise comparisons and one-sentence summaries.
Fine-tuned models were evaluated on related, untrained evaluative tasks (pairwise comparisons of pitches and one-sentence summary evaluations) and showed positive transfer performance relative to baselines. (Specific metrics, effect sizes, and sample sizes for these transfer tasks are not provided in the supplied text.)
The core findings (harm from ToM order mismatches and benefits from A-ToM) are robust to partners beyond LLM-driven agents.
Paper reports robustness checks testing generalization to non-LLM agent classes (details summarized in robustness section); comparisons use the same coordination metrics.
A-ToM recovers coordination performance by aligning its effective ToM depth with partners across a range of multiagent tasks.
Experimental results showing A-ToM achieves coordination levels closer to matched fixed-order pairings across the repeated matrix game, grid navigation tasks, and Overcooked when facing partners with different fixed ToM depths.
An adaptive ToM (A-ToM) agent that infers its partner's ToM order from prior interactions and conditions its predictions and actions on that estimate restores alignment and improves coordination.
Implemented A-ToM (estimation from interaction history + conditioning of partner-action predictions) and evaluated it against fixed-order agents in the four environments; reported improvements in coordination metrics when A-ToM paired with partners of varying ToM orders.
Security testing included prompt-injection/adversarial inputs to probe the security agent and layered defenses.
Paper reports conducting prompt-injection/adversarial tests as part of security evaluation; the summary does not include the number, nature, or success/failure rates of these tests.
Rubric-based, structured scoring promotes consistent, auditable judgments and reduces subjective assessor bias.
System implements rubric-based, multi-dimensional scoring and the paper asserts this improves consistency and auditability; no reported inter-rater reliability statistics or controlled comparisons to human/monolithic baselines are provided in the summary.
Isolating sensitive logic (scoring rubrics, adaptive difficulty rules) from free-text generation reduces the attack surface.
Design principle implemented in the architecture (separation of concerns between agents); claimed benefit in the paper. Empirical validation details (quantitative reduction in successful attacks) are not provided in the summary.
CoMAI implements multi-layered defenses against prompt-injection and other prompt-level attacks via a dedicated security agent and constrained state transitions.
System design (a dedicated security/validation agent and a finite-state machine enforcing information flow) and reported security testing that included prompt-injection/adversarial inputs to probe defenses.
Candidate satisfaction with CoMAI was 84.41%.
Reported experimental metric in the paper summary; likely derived from post-interview surveys, but survey design, sample size, and response rates are not specified in the summary.
In experiments CoMAI achieved 83.33% recall.
Reported experimental metric in the paper summary; no information provided on how recall was computed (e.g., per-class vs. overall), sample sizes, or confidence intervals.
In experiments CoMAI achieved 90.47% accuracy.
Reported experimental metric in the paper summary. The underlying dataset size, class balance, and baseline comparison details are not provided in the summary.
CoMAI outperforms monolithic LLM-based assessments on robustness, fairness, and interpretability.
Comparative framing and reported experiments in the paper claiming improved robustness, fairness, and interpretability relative to single-agent LLM baselines; however, baseline specifics, dataset sizes, and statistical tests are not disclosed in the provided summary.
The clarification protocol elicits missing premises or confirms intent rather than producing an ill-aligned response.
Paper describes structured clarification templates (binary checks, multi-choice scaffolds, short clarifying questions) intended to elicit missing information; this is a design assertion without reported user-study evidence.
There are potential welfare gains from improved decision quality and trust in automation, particularly where human oversight remains required.
Conceptual welfare analysis; no welfare quantification or simulations provided.
Structured AFs can reduce information asymmetry by making reasoning traceable, thereby lowering search and verification costs in transactions and contracting.
Economic reasoning drawing on information-asymmetry theory; no empirical transaction-cost measurements given.
Firms offering argumentatively transparent AI can obtain competitive advantage and charge premium prices for verifiability and auditability.
Economic reasoning and market-structure inference; no empirical pricing or demand elasticity studies provided.
Demand will shift toward AI systems that provide verifiable, contestable reasoning in regulated/high‑stakes sectors (healthcare, law, finance, public policy).
Economic argument and market prediction in the paper; speculative without market data or forecasting models presented.
This approach supports collaborative reasoning ('with' humans) rather than opaque automation 'for' humans, improving uptake in high‑stakes settings.
Conceptual argument about human-in-the-loop workflows and collaborative roles; no empirical uptake or deployment data presented.
Framing decisions as contestable and revisable (via dialectical challenge and update) increases robustness and trust in AI-supported decision-making.
Conceptual claim arguing that contestability/revision improve robustness and trust; no experimental evidence or user studies provided.
Running formal dialectical/acceptability semantics and dialogue protocols over AFs enables agents that reason with humans through structured debates and revisions.
Conceptual integration of formal semantics (Dung-style, bipolar, weighted) and dialogue protocols; no human-subject studies or system evaluations reported.
Argumentation Framework Synthesis: mined fragments can be combined into coherent formal argumentation frameworks (AFs) with explicit semantics enabling verification and automated inference.
Conceptual algorithmic proposal (graph synthesis, canonicalization, formal semantics); no empirical synthesis results or benchmarks presented.
Argumentation Framework Mining: LLMs and NLP pipelines can be used to extract claims, premises, relations (attack/support), and provenance from text corpora.
Proposed methodological pipeline (fine-tuning/prompting LLMs and IE pipelines); conceptual proposal without implementation details or experimental results.
Combining formal argument structures with LLMs’ ability to mine and generate rich, contextual arguments from unstructured text promises human-aware, verifiable, and trustable AI for high‑stakes domains.
Conceptual synthesis of computational argumentation (formal AFs) and LLM capabilities; no empirical validation or quantified metrics provided.
Integrating computational argumentation with large language models (LLMs) creates a new paradigm—Argumentative Human-AI Decision‑Making—where AI agents participate in dialectical, contestable, and revisable decision processes with humans.
Conceptual / design argument presented in the paper; no empirical implementation or sample; draws on prior work in computational argumentation and capabilities of LLMs.
There will likely be growth in complementary markets for model verification, provenance tracking, legal-AI audits, and human-in-the-loop workflow services.
Market foresight based on identified unmet needs (explainability, verification) and illustrative examples; no market-sizing data.
The project demonstrates that high-skill, knowledge-intensive tasks (formal mathematics) can be substantially automated with a heterogeneous AI toolchain, reducing human coding labor while retaining supervisory oversight.
Inference from project outcomes: AI tools produced formal Lean code and discharged lemmas while the reported human supervisor did not write code; single-project evidence (n=1), qualitative and quantitative logs support partial automation.
The formalization finished prior to the final draft of the corresponding informal math paper.
Timing claim reported in the paper comparing formalization completion date to the final draft date of the related math paper (self-reported for the single project).
Effective practices included splitting proofs into abstract (high-level reasoning) and concrete (formalization) parts, having agents perform adversarial self-review, and targeting human review to key definitions and theorem statements.
Process-level recommendations drawn from the project's workflow; paper reports these practices as successful for this single development (n=1 project) based on qualitative assessment.
One mathematician supervised the process over approximately 10 days, reported a human cost of about $200, and wrote no code.
Self-reported human-role summary in the paper: single supervisor, ~10 days supervision time, reported monetary cost ≈ $200, and assertion that the human wrote no code (n=1 human supervisor for the project).
Governance should be hybrid and structured: legal/regulatory frameworks (e.g., EU AI Act), technical standards (ISO safety norms), and crisis-management practices must be combined to allocate responsibilities and intervention authority.
Policy and standards synthesis drawing on EU AI Act, ISO standards, and crisis-management literature; prescriptive argument without empirical testing.