Evidence (6491 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
A box-level analysis confirms that the uncertainty cues redirect annotator effort toward high-uncertainty predictions and away from well-localized boxes.
Box-level analysis reported in paper comparing annotator behavior across predicted boxes with differing localization uncertainty; analysis shows effort reallocation toward boxes labeled as high-uncertainty.
In the same controlled study, participants who received uncertainty cues were faster overall (reduced annotation time).
Same controlled user study with 120 participants comparing interfaces with and without spatial-uncertainty visualizations; paper reports that participants with cues were faster overall.
In a controlled study with 120 participants, those receiving uncertainty cues achieve higher label quality.
Controlled user study reported in the paper; 120 participants; comparison between annotators who received visualized spatial-uncertainty cues via a purpose-built interface and those who did not; paper reports label quality outcomes.
The model identifies simple measures/conditions that characterize when productivity paradoxes and skill polarization arise.
Theoretical derivations and analytical characterizations within the model yielding threshold conditions and measures parameterizing when paradoxical outcomes occur (model-based; no empirical validation).
Replicating the within-subject experiment with simulated users recovers aggregate model hierarchies (i.e., the same ranking of models at the population level).
A replication of the human within-subject experiment using simulated users; authors report that aggregate model ranking/hierarchy is preserved between simulators and humans.
People reward sycophancy and relationship-seeking behaviours in short-term evaluations.
Participant judgments in the blinded multi-turn conversations (same 530-participant experiment) indicated higher short-term preference ratings for outputs exhibiting sycophancy/relationship-seeking.
Preference fine-tuning (P-DPO) significantly outperforms both a generic model and personalised prompting in blinded multi-turn conversations with human participants.
Within-subject blinded multi-turn conversation experiment with 530 human participants comparing P-DPO, a generic model, and personalised prompting; statistical comparison reported in paper (claimed 'significantly outperforms').
A digital twin analytics platform validation shows that a single codebase with domain-specific ontology configurations eliminates tool-call hallucination and achieves cross-domain configurability without application code changes.
Validation/demonstration reported in the paper using a digital twin analytics platform; platform demonstration claimed to eliminate tool-call hallucination and enable cross-domain configurability via configuration only.
In the same controlled experiment, ontology-grounded parameters reduced domain-identifier hallucination to 0%.
Same controlled experiment (six industry configurations, 72 tool invocations with Qwen3-32B) reported in the paper; ontology-grounded parameter condition produced 0% hallucination.
The architecture is formalized as a three-operation interface contract — resolve, contextualize, annotate — with invariants enforced by an AIOps orchestration layer.
Design specification and formalization presented in the paper (architectural description).
Embedding manufacturing ontology directly into the AI tool layer as a typed relational configuration enforces semantic constraints at runtime and closes the semantic training gap.
Proposed system architecture described and argued in the paper; validated via demonstrations and experiments described later in the paper.
Sustainable progress requires collaborative integration of humans and machines, rather than replacement.
Normative conclusion/recommendation stated in the paper based on study findings (argument for augmented intelligence over replacement).
This research presents the innovative Marketing Intelligence Operations (MIO) Framework and a practical AI Adoption Readiness Scorecard, enabling leaders to manage the operational balance between transformative efficiency improvements and human capital vulnerability.
Paper states that it introduces a new framework and a practical scorecard as deliverables of the research (descriptive claim about the paper's contributions).
AI-integrated Marketing Intelligence Operations (MIO) quantitatively improves campaign Return on Investment (ROI) by 47%.
Reported as an empirical result from the paper's mixed-methods study (the paper states use of audits, surveys, and NLP analysis to evaluate MIO outcomes).
Deploying LegalCheck in the Municipality of Amsterdam demonstrated substantial efficiency gains, improved legal consistency, and positive user acceptance.
Summary claim based on the real-world deployment outcomes described in the paper (timing improvements, consistency/factual accuracy statements, and reported positive reception by professionals); specific quantitative metrics and sample sizes are not fully reported in the excerpt.
The system produced explainable outputs based on actual regulations and prior cases, providing citations/explainability that support legal reasoning.
Paper describes retrieval from curated legal knowledge bases and generation of outputs grounded in regulations and prior cases during the Amsterdam deployment; presented as a feature of the system and supported by expert review.
LegalCheck uses a combination of Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG) with curated legal knowledge bases and controlled prompting to retrieve relevant laws and precedents and incorporate case-specific details into coherent drafts.
System architecture and methodology described in the paper (design/implementation claim).
Legal professionals found that the system ensured a consistent application of legal standards without replacing human judgment.
Reported qualitative feedback from professionals in the Municipality of Amsterdam deployment and the system design that includes an expert-in-the-loop review; no formal measurement of 'replacement' was reported.
Legal professionals found that the system reduced their workload.
Reported user feedback from legal professionals during the Municipality of Amsterdam deployment; qualitative statements that professionals experienced workload reduction (no numeric workload metrics or sample size reported).
The system's output captured the vast majority of required legal reasoning—often 80% to 100% of essential content.
Reported coverage statistic from the deployment/evaluation described in the paper (phrased as 'often 80% to 100% of essential content'); exact evaluation method, sample size, and measurement protocol are not provided in the excerpt.
LegalCheck maintained high legal consistency and factual accuracy when generating draft letters.
Evaluation during real-world deployment with expert-in-the-loop review and feedback from legal professionals in the Municipality of Amsterdam; claims of high consistency and factual accuracy are reported but no formal numeric accuracy metric or sample size is provided in the text.
LegalCheck produced near-final advice letters in minutes rather than hours.
Reported results from a real-world deployment within the Municipality of Amsterdam; system logs / timing comparisons between human drafting time (hours) and LegalCheck-assisted drafting time (minutes) are described in the paper (no explicit numeric sample size reported).
We outline a research program for the runtime systems that foundation-model software agents will require.
Paper claims to present a forward-looking research agenda or program (stated in abstract); this is a conceptual contribution rather than an empirical finding.
Applied to a controlled validation task, the framework yields episode packages whose evidence structure varies systematically with harness level: lower levels produce only a final patch, while higher levels produce reproduction logs, failure attributions, deterministic requirement checks, and structured verification reports.
Empirical application described in the abstract: framework applied to a controlled validation task showing systematic variation in episode-package evidence structure across harness levels. The abstract does not report sample size or statistical measures.
We propose a trace-based evaluation protocol that converts each agent run into an auditable episode package.
Methodological proposal described in the abstract proposing a trace-based protocol and an auditable episode package format; no quantitative evaluation details provided in the abstract.
We operationalize the harness through a four-level ladder (H0–H3) that progressively exposes runtime support to the agent.
Design contribution described in the paper (abstract) introducing a four-level ladder (H0–H3) as an operationalization of the harness concept.
Foundation models have transformed automated code generation.
Statement in paper's abstract referring to broad impact of foundation models on automated code generation; likely supported by citations and literature overview within the paper (no sample size or quantitative study reported in the abstract).
Authorship preservation should be a design priority for AI tools deployed in identity-relevant, behavior-dependent tasks.
Authors' recommendation based on experimental results showing negative motivational and behavioral consequences of delegating authorship to LLMs despite improved objective goal quality.
Mediation analyses identified psychological ownership as the mechanism: it mediated the authorship effect on every downstream motivational and behavioral outcome, while objective goal quality did not.
Mediation analyses reported in the preregistered experiment (authors tested psychological ownership and objective goal quality as mediators of authorship effects on multiple downstream outcomes); preregistered N = 470.
At two-week follow-up, 72.8% of self-authored participants had acted on two or more of their goals, compared to 46.6% in the LLM condition.
Behavioral follow-up measure collected two weeks after the intervention in the preregistered experiment; percentages reported in the paper/abstract. (Follow-up completion N not specified in the abstract.)
LLM-generated goals scored higher on SMART criteria (specificity, measurability, achievability, relevance, and time-boundedness).
Preregistered randomized experiment comparing self-authored vs LLM-authored goals derived from a personal reflection; reported effect size d = 2.26; total preregistered N = 470.
As an initial diagnostic intervention, a probe-guided logit adjustment (PGLA) re-injects the encoded mismatch signal into decoding and consistently improves rejection behavior.
Intervention experiments applying PGLA to model decoding on IMAVB; reported consistent improvements in the models' tendency to reject misleading premises after logit adjustment guided by probes.
We introduce IMAVB, a curated 500-clip benchmark of long-form movies with a 2x2 design crossing target modality (vision, audio) and premise condition (standard, misleading), which lets us measure conflict detection separately from ordinary multimodal comprehension.
Description of new benchmark introduced in paper: 500 clips, 2x2 design (vision vs audio × standard vs misleading premises); used to measure conflict detection independently of standard multimodal QA.
The Agent-First paradigm is orthogonal and complementary to transport-layer standards such as MCP, operating as the semantic application layer above existing tool discovery and invocation protocols.
Conceptual argument and mapping presented in the paper asserting interoperability/orthogonality with transport-layer standards (e.g., MCP).
Agent-First APIs improve autonomous error recovery by 5.8x (compared to optimized CRUD baselines).
Reported comparative experiments on 50 real operational tasks measuring autonomous error recovery capability.
Agent-First APIs reduce required human interventions by 72.7% (compared to optimized CRUD baselines).
Same set of comparative experiments on 50 real operational tasks reported in the paper.
Comparative experiments on 50 real operational tasks demonstrate that Agent-First APIs achieve 88% end-to-end task success rate versus 64% for optimized CRUD baselines (+37.5%).
Empirical comparative experiments reported in the paper on 50 real operational tasks, comparing Agent-First APIs to optimized CRUD baselines.
The paradigm is implemented and validated in a production multi-tenant SaaS platform serving 85 registered tools across 6 business domains.
Reported production implementation and deployment statistics (platform with 85 registered tools spanning 6 business domains).
We propose the Agent-First Tool API paradigm, comprising three integrated mechanisms: (1) a Six-Verb Semantic Protocol that decomposes tool interactions into search, resolve, preview, execute, verify, and recover phases; (2) a Normalized Tool Contract (NTC) providing structured decision-support metadata including confidence scores, evidence chains, and suggested next actions; and (3) a dual-layer governance pipeline combining static capability policies with dynamic risk escalation.
Design and specification presented in the paper (proposed architecture and components).
LLMs can help generate more correct and functional code compared to participant-generated solutions.
Comparative analysis of generated solutions reported in the paper (no sample-size for solutions explicitly stated in the abstract). The paper states LLM-assisted solutions were more correct/functional.
Qualitative analysis of participants' interactions and interviews revealed four different human-LLM collaboration modes supporting various problem-solving strategies.
Qualitative analysis of interaction logs and retrospective interviews from the study participants (N=20) reported in the paper; identification of four collaboration modes described.
We conducted a within-subject study followed by retrospective interviews with programmers (N=20).
Stated methods in the paper: within-subject experimental design plus retrospective interviews; sample size explicitly given as N=20.
AwareLLM opens new avenues for Human-AI collaboration where technology adapts to users' needs rather than users adhering to technological constraints.
Authorial/conceptual claim based on the proposed framework and study results; presented as a broader implication rather than a direct empirical finding.
Participants described AwareLLM's personalized interventions as timely and relevant, helping them boost their confidence and deepen engagement with their work.
Qualitative user feedback reported in the study (participant descriptions); sample size 20. No coding details or counts provided in the abstract.
AwareLLM reduced mental demand for participants.
Reported results from the user study (comparison to a standard LLM assistant) with 20 participants; abstract reports reductions but gives no quantitative metrics.
AwareLLM led to reductions in cognitive fatigue.
Reported results from the user study comparing AwareLLM to a standard LLM assistant; sample size 20. No quantitative values provided in the abstract.
Compared to a standard LLM assistant, AwareLLM produced statistically significant improvements in task performance.
Results reported from the user study (comparison vs. standard LLM); sample size noted as 20 participants. No numerical effect size provided in the abstract.
AwareLLM dynamically adapts to users' psychophysiological states while analyzing temporal patterns and behavioral tendencies to provide personalized and timely interventions.
Design and claimed operational behavior of the proposed framework as described by authors.
We introduce AwareLLM, a multimodal framework that integrates egocentric vision, pupillometry, eye-gaze tracking, posture detection, heart activity, and large language models to create a proactive and context-aware ecosystem.
System/methods description in paper (architecture/design claim).
Information workers' productivity is significantly influenced by their cognitive states and physiological responses.
Background statement in paper (literature-motivated claim); no study data provided within the abstract to support it.