Evidence (6491 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models.
Empirical experiments reported across 21 problems in six domains using gpt-oss models (paper states 21 problems).
We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection.
Methodological contribution described in the paper (framework design and algorithmic description).
We propose seven interface primitives operationalizing verification-centered HCI.
Design contribution: specification of seven interface primitives within the paper (conceptual/design proposal); no user-study or empirical validation reported.
We map synthetic literacy -- oral input generating literate output -- as the defining feature of this transition.
Conceptual mapping and theoretical framing within the paper; supported by examples from technology trends but no empirical evaluation reported.
Knowledge workers become adversarial auditors rather than keystroke-producers.
Projected role-shift based on the verification-bottleneck thesis and interdisciplinary supporting arguments; no empirical longitudinal workforce study reported.
The central contribution identifies the verification bottleneck: as AI collapses production friction, the primary constraint shifts from generation to evaluation.
Theoretical argument supported by literature synthesis across multiple fields; no direct experimental quantification provided.
We contribute design guidelines for specialized AI and articulate a vision for 'ecosystem-aware' Humble AI.
Paper's stated contributions (design guidelines and conceptual vision) described in the abstract.
Qualitatively, participants used AVA as a specialized 'evidence engine'; reasoned abstention clarified scope boundaries, and trust was calibrated through institutional provenance and page-anchored citations.
Qualitative findings from surveys and 20 interviews reported in the paper (participant quotations and thematic analysis implied in abstract).
Difference-in-Differences estimates associate sustained engagement with 2.4-3.9 hours saved weekly.
Quantitative claim reported in the paper based on Difference-in-Differences analysis of usage/engagement data from the evaluation (implicit sample drawn from the >2,200 participants).
AVA operationalizes epistemic humility through two mechanisms: citation verifiability (tracing claims to sources) and reasoned abstention (declining unsupported queries with justification and redirection).
Design claim describing implemented mechanisms in the platform; described in the paper as operational features.
AVA's multi-agent pipeline enables users to query and receive evidence-based syntheses.
System design and capability claim in the paper (description of multi-agent pipeline producing evidence-based syntheses).
AVA is a GenAI platform built on a curated library of over 4,000 World Bank Reports with multilingual capabilities.
System description provided in the paper; statement of dataset size and functionality (library count and multilingual support).
Code-generating Artificial Intelligence has gained popularity within both professional and educational programming settings over the past several years.
Background statement in the paper's introduction (observational claim about recent trends in AI adoption).
The emotional effect of the human teammate was significantly more positive and arousing compared to working with Copilot.
Subjective emotion measures (valence/arousal) collected in the study; reported significant differences favoring human teammate on positivity and arousal (n=22).
Several dimensions of participants' workload were significantly reduced when using GitHub Copilot.
Subjective workload measures collected during the experiment; multiple workload dimensions reported as significantly lower in the Copilot condition (n=22).
Participants performed significantly better with GitHub Copilot than with their human teammate.
Experimental comparison of task performance between Copilot-assisted individual condition and human pair condition; statistical significance reported in results (sample size n=22).
Evaluation demonstrates speed improvements of 6-7 minutes over traditional methods.
Reported empirical timing result in paper abstract: 6-7 minutes (presumably time to validate a change) compared to traditional methods (no further detail or sample size in abstract).
Evaluation demonstrates diagnostic coverage of 92-96%.
Reported empirical range in paper abstract (92-96% diagnostic coverage over evaluated cases; specific n not provided in abstract).
Evaluation demonstrates promising results in error detection (100%).
Reported empirical result in paper abstract: 100% error detection over evaluated scenarios (no sample size given in abstract).
By orchestrating agent collaboration atop this digital twin, Aether enables automated, rapid network change validation while reducing manual effort, minimizing errors, and improving operational agility and cost-effectiveness.
High-level claim supported by system design and subsequent empirical evaluation reported in paper (evaluation details referenced in abstract).
Aether agents use a unified Network Digital Twin integrating modeling, simulation, and emulation to maintain a consistent, up-to-date network view for verification and testing.
Design claim describing the digital twin's capabilities (modeling, simulation, emulation) as part of the system; presented in paper text.
Aether features an agentic architecture with five specialized Network Operations AI agents that collaboratively handle the change validation lifecycle from intent analysis to network verification and testing.
System architecture claim in paper describing five specialized agents (design specification; no empirical sample size).
Aether integrates Generative Agentic AI with a multi-functional Network Digital Twin to automate and streamline network change validation workflows.
Paper describes Aether system design and architecture combining agentic AI and a digital twin (design-level claim; architectural description).
A common response to these worries stresses that the goods derived from work can be found elsewhere, often in better activities, suggesting that the proliferation of AI-powered automation does not threaten the meaningfulness of people’s lives.
Description of a commonly offered counterargument in the literature and popular debate (conceptual/literature-summary; no empirical data or sample reported).
The study uses a combination of cognitive systems theory, diplomatic negotiation models, and empirical Human-in-the-Loop experiments as its methodological basis.
Methods description in the paper listing theoretical foundations and empirical HITL experiments as components of the study design.
The paper outlines recommendations for international norm development, capacity building, and the creation of interoperable, transparent AI systems for diplomacy.
Policy recommendation section of the paper proposing international norms, capacity-building measures, and interoperable transparent system design.
Experimental HITL data indicate a 17% reduction in cognitive bias for hybrid human-AI teams.
Human-in-the-Loop (HITL) experiments reported in the paper; comparison of cognitive bias measures between hybrid teams and baseline (sample size not provided in summary).
Experimental HITL data indicate that hybrid human-AI teams achieved 23% faster consensus-building.
Human-in-the-Loop (HITL) experiments reported in the paper; experimental comparison between hybrid human-AI teams and baseline (details on sample size not reported in summary).
The framework is validated through real-world and simulated case studies, including UN ceasefire mediation, EU sentiment-monitoring for conflict diplomacy, and African Union peacekeeping planning.
Validation reported via a set of real-world and simulated case studies described in the paper (case study methodology; specific cases named).
Each layer augments a core dimension of diplomatic reasoning, enabling interpretable AI contributions, foresight analysis, culturally sensitive framing, and legally compliant outputs.
Conceptual mapping of each proposed layer to functional capabilities described in the paper; claimed alignment with interpretability, foresight, cultural framing, and legal compliance.
The study proposes a five-layer Human-AI collaboration architecture tailored to multilateral diplomacy consisting of: (1) Context Modeling, (2) Scenario Generation, (3) Cognitive Interfacing, (4) Decision Support, and (5) Ethical-Normative Governance.
Architectural proposal in the paper based on synthesis of literature and design choices; claimed as the output of the conceptual framework.
This paper develops the concept of Artificial Diplomacy as a structured interface between human strategic cognition and machine-supported reasoning.
Theoretical development drawing on cognitive systems theory and diplomatic negotiation models; described design and conceptual argumentation in the paper.
These divergences (between simulation and human data and across scenarios) provide crucial insights for the future design of human-centered AI agents.
Paper conclusion in abstract indicating practical implications and discussion of how divergences vary across contexts and what that implies for design.
With actual human subjects, AI attributes—particularly transparency—were much more impactful than personality traits.
Abstract reporting results from the human-subjects experiment (N=290) indicating AI attributes, especially chain-of-thought transparency, had greater impact.
In simulation experiments, personality traits and AI attributes were comparatively influential on outcomes.
Abstract claim summarizing simulation experiment results (based on the 2,000 simulated runs) that personality and AI attributes were influential.
Policymakers can reinforce these conditions by shifting from technology-neutral principles to auditable process standards that couple AI investment with reskilling and data-quality obligations.
Policy recommendation based on the study's findings and synthesis; presented as a normative implication rather than empirically tested within the study. (Sample size not reported.)
Leaders should fund training coverage and design (not just headline hours), equip non-specialists to interpret model outputs, pair performance artefacts with participatory routines, and treat explainability as a usability requirement to achieve durable, auditable value in safety-critical energy contexts.
Prescriptive recommendation based on a 'field-tested playbook' synthesised from the multi-case qualitative study (interviews, surveys, documents). The claim is drawn from authors' interpretation of cross-case patterns rather than causal inference. (Sample size not reported.)
Structured upskilling and precise recourse mechanisms are associated with higher confidence, productivity, and clearer sustainability pathways.
Observed association in multi-case qualitative data: interviews, staff/manager surveys, and policy documents; triangulated through thematic coding and cross-case synthesis. (Sample size not reported.)
A tight workflow fit that minimises cognitive overhead at the decision point accelerates legitimate use and strengthens links to emissions monitoring and predictive-maintenance outcomes.
Synthesised from interviews, Likert-scale surveys of technical staff and managers, and internal workflow/policy documents across multiple cases in the energy sector. (Sample size not reported.)
Communicative governance — e.g. model cards, bias tests, validation reports, and explicit appeal rights — earns trust, curbs shadow workarounds, and improves safety culture.
Reported from thematic coding of interviews, surveys of staff and managers, and documentary evidence across multiple cases; triangulation claimed. (Sample size not reported.)
Broad-based capability building beyond specialist teams prevents benefits from concentrating in expert enclaves and reduces brittle scale.
Derived from cross-case thematic synthesis of interviews, Likert surveys of mid-level managers and technical staff, and internal policy/strategy document analysis (multi-case qualitative evidence). (Sample size not reported.)
Three reinforcing levers shape adoption outcomes: (1) broad-based capability building beyond specialist teams, (2) communicative governance that couples transparency with contestability, and (3) a tight workflow fit that minimises cognitive overhead at the decision point.
Qualitative, multi-case design triangulating a semi-structured interview with a senior manager, Likert-scale surveys of mid-level managers and technical staff, and analysis of internal policies and strategy documents; thematic coding with intercoder reliability and cross-case synthesis. (Sample size not reported.)
The framework demonstrates how digital intelligence can enhance supply chain resilience while supporting, rather than replacing, human decision-making (human-centric/planner-centered decision support).
Framework design emphasizes human-centric decision support; field deployment reported to be planner-centered (paper claims support rather than replacement of human decision-making).
The results indicate that upstream textile SMEs can leverage publicly visible e-commerce signals to enhance production planning responsiveness, minimize inventory exposure and dye-lot disruptions, and strengthen resilience to demand uncertainty through planner-centered digital decision support.
Synthesis claim based on model results, validation of comment volume as sales proxy, Monte Carlo-based production guidance, decision dashboard design, and the 12-month field study outcomes.
This research extends the C2M paradigm from downstream retail contexts to upstream textile SMEs and proposes an integrated and operationally feasible intelligence framework for resource-constrained manufacturers.
Conceptual claim supported by the methodological development, large-scale e-commerce data modeling, and a field deployment at one SME reported in paper.
In the same 12-month field study, implementation resulted in a 16% increase in capacity utilization.
Field deployment measurements reported in paper for one Taiwanese dyeing SME over 12 months.
In the same 12-month field study, implementation resulted in a 31% decrease in dye lot changeovers.
Field deployment measurements reported in paper for one Taiwanese dyeing SME over 12 months.
In a 12-month field study at a Taiwanese dyeing SME, implementation resulted in a 28% reduction in inventory value.
Field deployment and before-after (or intervention) measurement reported in paper over 12 months at one Taiwanese dyeing SME.
Forecasts were translated into production guidance using Monte Carlo simulation and a decision dashboard.
Description of operationalization methods in paper: Monte Carlo simulation and a planner-facing decision dashboard used to convert forecasts into production guidance.
Consumer comment volume was validated as a proxy for sales activity, facilitating demand estimation.
Validation analysis reported in paper linking consumer comment volume to sales activity (methodological validation; specific statistical details not provided in abstract).