Evidence (6491 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

The spectrum focuses attention on where leadership work occurs: who frames the problem, who redirects the work, and who can answer for what follows.

Conceptual argument in the paper describing the axes/criteria of the spectrum (theoretical/thematic analysis; no empirical data reported).

high positive Leading Across the Spectrum of Human-AI Relationships: A Con... allocation of leadership activities (framing, redirecting, accountability) in hu...

This paper offers a leadership-facing spectrum to see human–AI decision relationships with five positions: Pure Human, Centaur (human-dominant, with AI in the loop), Co-equal, Minotaur (AI-dominant, with humans in the loop), and Pure AI.

Conceptual presentation in the paper: a theorized five-position spectrum (no empirical sample or experiment reported).

high positive Leading Across the Spectrum of Human-AI Relationships: A Con... presence of a conceptual spectrum for classifying human–AI decision configuratio...

The paper formalizes these limitations, addresses four alternative views, and proposes a co-existence solution plus a call to action for system builders, benchmark designers, and the memory community.

Meta-claim about the paper's content: formalization, rebuttals, and recommendations stated in the abstract; no empirical sample reported in abstract.

high positive Contextual Agentic Memory is a Memo, Not True Memory proposed research and design agenda (co-existence of lookup and weight-based mem...

Complementary Learning Systems (CLS) theory shows biological intelligence solved this problem by pairing fast hippocampal exemplar storage with slow neocortical weight consolidation.

Appeal to established neuroscience theory (CLS); the paper draws on CLS literature to justify the two-system solution in biology; no new empirical sample reported in abstract.

high positive Contextual Agentic Memory is a Memo, Not True Memory memory architecture in biological intelligence (hippocampus + neocortex)

AI product builders should recognize that they are designing not just model behavior but user behavior; encouraging deep engagement, rather than friction-free experiences, will lead to more success overall.

Policy/design recommendation based on the paper's analyses of 27K annotated transcripts showing links between user fluency, engagement patterns, failure visibility, recovery, and success.

high positive A paradox of AI fluency product design recommendation (encouraging deep engagement)

Individuals should adopt a stance of active engagement rather than passive acceptance.

Interpretive recommendation derived from observed differences in outcomes by user fluency in the 27K annotated transcript analysis (paper’s discussion/recommendation section).

high positive A paradox of AI fluency recommended user behavior (active engagement)

Fluent users' failures are more likely to lead to partial recovery.

Analysis of conversation trajectories in the 27K annotated transcripts showing higher incidence of partial recovery (follow-up iterations leading to partial fix) after failures by fluent users.

high positive A paradox of AI fluency partial recovery rate after failures

Fluent users' failures tend to be visible (a direct consequence of their engagement).

Annotations of failure visibility within the 27K transcripts, comparing frequency of visible vs. invisible failures across fluency levels.

high positive A paradox of AI fluency visibility of failures (visible vs. invisible failures)

Fluent users take on more complex tasks than novices.

Observational analysis of a richly annotated sample of 27,000 transcripts drawn from the WildChat-4.8M dataset; transcripts were annotated for user fluency and task characteristics (as reported in the paper).

high positive A paradox of AI fluency task complexity

Organizations should cultivate a culture of critical engagement with AI outputs, and e-leadership development must focus on building competencies in mediating, filtering and legitimizing AI contributions within digital workflows.

Recommendations based on thematic analysis of interview data across 34 project managers; presented as implications rather than empirically tested interventions.

high positive E-leadership and human-AI collaboration: socio-technical ali... organizational practices / e-leadership competencies (intended to improve team/o...

To achieve balanced augmentation, leaders must proactively frame AI's role, embedding validation checkpoints and human authorship clauses to maintain accountability.

Prescriptive recommendation derived from thematic findings and cross-case patterns in the 34 interviews; no experimental or longitudinal testing reported.

high positive E-leadership and human-AI collaboration: socio-technical ali... accountability / balanced augmentation (implied improvement in team effectivenes...

Proactive engagement combined with creation-oriented use generated the highest effectiveness.

Qualitative coding and cross-case comparisons in the thematic analysis of 34 interviews identified combinations of proactive e-leadership and creation-oriented AI use associated with reported high team effectiveness.

high positive E-leadership and human-AI collaboration: socio-technical ali... perceived team effectiveness

The trajectory of the curvilinear relationship is governed by e-leadership practices.

Interview data analyzed thematically showing recurring references to leadership practices as moderators of AI-use effectiveness across the 34 interviews.

high positive E-leadership and human-AI collaboration: socio-technical ali... perceived team effectiveness (as moderated by e-leadership)

Based on these insights, we offer design recommendations for generative AI-powered learning tools for freelancers.

Paper contribution section — authors present design recommendations derived from study findings (not an empirical claim about an evaluated intervention).

high positive Upskilling with Generative AI: Practices and Challenges for ... design guidance intended to improve generative AI learning tool suitability/effe...

Freelancers increasingly rely on generative AI to structure learning and support exploratory skill acquisition.

Reported finding from the paper's mixed-methods study (survey + semi-structured interviews with freelance knowledge workers).

high positive Upskilling with Generative AI: Practices and Challenges for ... use of generative AI tools for structuring learning and exploratory skill acquis...

Models across all three families acquire interpretable mechanical reasoning strategies without fine-tuning.

Observation reported for the three open-source models used in experiments (Llama 3.3 70B, Qwen3 4B, Qwen3 MoE 30B-A3B) showing emergent, interpretable mechanical reasoning during the iterative design process without any model fine-tuning.

high positive Language Models Refine Mechanical Linkage Designs Through Sy... acquisition of interpretable mechanical reasoning strategies

The system correctly diagnoses underconstraint failure modes 35.6% of the time.

Reported diagnostic accuracy for underconstraint failure mode in the experimental results (35.6%).

high positive Language Models Refine Mechanical Linkage Designs Through Sy... accuracy in diagnosing underconstraint failure mode

The system correctly diagnoses overconstraint failure modes 56.3% of the time.

Reported diagnostic accuracy for overconstraint failure mode in the experimental results (56.3%).

high positive Language Models Refine Mechanical Linkage Designs Through Sy... accuracy in diagnosing overconstraint failure mode

78.6% of iterative refinement trajectories show measurable improvement.

Reported aggregate statistic from the experimental evaluation of iterative refinement trajectories (percentage improvement across trajectories).

high positive Language Models Refine Mechanical Linkage Designs Through Sy... presence of measurable improvement across iterative refinement trajectories

The modular architecture improves structural validity by up to 134% over monolithic baselines.

Empirical results reported across six motion targets and three models comparing modular architecture to monolithic baselines; the paper reports an improvement in structural validity up to 134%.

high positive Language Models Refine Mechanical Linkage Designs Through Sy... structural validity of linkage designs

The modular architecture reduces geometric error by up to 68% over monolithic baselines.

Empirical results reported across six engineering-relevant motion targets and three open-source models comparing the modular architecture to monolithic baselines; the paper states a maximum reduction of geometric error of 68%.

high positive Language Models Refine Mechanical Linkage Designs Through Sy... geometric error

Language models can systematically improve linkage designs through symbolic representations.

Reported experiments using a modular architecture combining language-model agents and numerical optimisers across six engineering-relevant motion targets and three open-source models (Llama 3.3 70B, Qwen3 4B, Qwen3 MoE 30B-A3B); comparisons reported versus monolithic baselines.

high positive Language Models Refine Mechanical Linkage Designs Through Sy... quality of linkage designs (geometric error, structural validity)

The proposed framework emerged from operational work to improve clinician capability in a live value-based care deployment.

Stated as originating from operational experience in a live deployment; no details on deployment scale, sample size, or outcomes provided in the excerpt.

high positive Learning from Disagreement: Clinician Overrides as Implicit ... improvement of clinician capability through operational application of the frame...

Training environments that combine longitudinal outcome measurement with aligned financial incentives are a necessary condition for learning a reward model aligned with patient trajectory rather than with encounter economics.

Normative/theoretical argument presented in the paper; no empirical tests or sample sizes reported in the excerpt.

high positive Learning from Disagreement: Clinician Overrides as Implicit ... alignment of learned reward model to patient trajectory versus encounter-level i...

Chronic disease management under outcome-based payment contracts produces override data with uniquely favorable properties for learning: longitudinal density, concentrated decision space, outcome labels, and natural capability variation.

Argument/claim in the paper that outcome-based contracts and chronic disease management produce favorable data characteristics; asserted as part of the framework motivation. No quantitative empirical evidence or sample sizes provided in the excerpt.

high positive Learning from Disagreement: Clinician Overrides as Implicit ... suitability of collected override data for training outcome-aligned reward model...

We propose a dual learning architecture that jointly trains a reward model and a capability model via alternating optimization, which prevents a failure mode we term 'suppression bias'—the systematic suppression of correct-but-difficult recommendations when clinician capability falls below the execution threshold.

Proposed algorithmic contribution and theoretical claim; suppression bias defined and a mitigation approach described. No empirical evaluation or sample sizes given in the excerpt.

high positive Learning from Disagreement: Clinician Overrides as Implicit ... reduction or prevention of suppression bias in learned recommendations

We formulate preferences conditioned on patient state s, organizational context c, and clinician capability κ, where κ decomposes into execution capability (κ-exec) and alignment capability (κ-align).

Presented as a formal model formulation in the paper; theoretical description without empirical sample sizes in the excerpt.

high positive Learning from Disagreement: Clinician Overrides as Implicit ... representational fidelity of preference model to contextual factors (patient, or...

We introduce a five-category override taxonomy that maps override types to distinct model update targets.

Stated as a formal contribution of the framework; taxonomy proposed in the paper. No empirical validation or sample size reported in the excerpt.

high positive Learning from Disagreement: Clinician Overrides as Implicit ... categorization of clinician overrides to inform model updates

Clinician overrides of clinical AI recommendations can be reframed as implicit preference data analogous to reinforcement learning from human feedback (RLHF), but richer because the annotator is a domain expert, the alternatives carry real consequences, and downstream outcomes are observable.

Conceptual argument presented in the paper drawing an analogy to RLHF; no empirical metrics or sample size reported in the excerpt.

high positive Learning from Disagreement: Clinician Overrides as Implicit ... quality of preference signal available for learning reward models from clinician...

Scalable synthetic computer creation, together with at-scale simulations, is highly promising as a foundational substrate for agent self-improvement and agentic reinforcement learning in long-horizon productivity scenarios.

Authors' conclusion/argument based on the methods and preliminary experimental results presented in the paper (interpretive claim rather than a quantified empirical result).

high positive Synthetic Computers at Scale for Long-Horizon Productivity S... suitability as a substrate for agent self-improvement and agentic RL

Given that personas are abundant at billion scale, this methodology can in principle scale to millions or even billions of synthetic user worlds with sufficient compute, enabling broader coverage of diverse professions, roles, contexts, environments, and productivity needs.

Argumentative/theoretical scalability claim based on the abundance of personas and the scalable design of the methodology (no empirical demonstration at millions/billions scale reported).

high positive Synthetic Computers at Scale for Long-Horizon Productivity S... scalability potential (number of synthetic user worlds producible)

Each run requires over 8 hours of agent runtime and spans more than 2,000 turns on average.

Reported runtime and turn-count metrics from the preliminary experiments (per-run runtime >8 hours; per-run average >2,000 turns).

high positive Synthetic Computers at Scale for Long-Horizon Productivity S... agent runtime per simulation run; number of turns per run

In preliminary experiments, we create 1,000 synthetic computers and run long-horizon simulations on them.

Reported preliminary experiment count in the paper (explicit statement: 1,000 synthetic computers were created and simulated).

high positive Synthetic Computers at Scale for Long-Horizon Productivity S... number of synthetic computers created and simulated

Conditioned on each synthetic computer, we run long-horizon simulations: one agent creates productivity objectives that are specific to the computer's user and require multiple professional deliverables and about a month of human work; another agent then acts as that user and keeps working across the computer ... until these objectives are completed.

Description of the two-agent simulation procedure in the paper (simulation design: objective-creating agent and user-acting agent executing tasks across the synthetic computer).

high positive Synthetic Computers at Scale for Long-Horizon Productivity S... ability to simulate long-horizon, user-conditioned productivity workflows

We introduce Synthetic Computers at Scale, a scalable methodology for creating such environments with realistic folder hierarchies and content-rich artifacts (e.g., documents, spreadsheets, and presentations).

Methodological description and implementation presented in the paper (design and procedures for generating synthetic computers and artifact types).

high positive Synthetic Computers at Scale for Long-Horizon Productivity S... creation of synthetic computer environments with realistic folder hierarchies an...

This work offers a principled foundation for autonomous AI agents that govern themselves the way humans do: not because rules are imposed upon them, but because deliberation is embedded in how they think.

Concluding claim summarizing the proposed framework's conceptual contribution (theoretical/architectural claim; not an empirical measurement).

high positive Think Before You Act -- A Neurocognitive Governance Model fo... internalized deliberative governance in autonomous agents

Implemented on a production-grade retail supply chain workflow, the framework produces zero false escalations to human oversight.

Empirical implementation on a production-grade retail supply chain workflow reported in the paper (claim stated without sample size or measurement protocol in the abstract).

high positive Think Before You Act -- A Neurocognitive Governance Model fo... false escalations to human oversight

Implemented on a production-grade retail supply chain workflow, the framework achieves 95% compliance accuracy.

Empirical implementation on a production-grade retail supply chain workflow reported in the paper (no sample size or evaluation details provided in the abstract).

high positive Think Before You Act -- A Neurocognitive Governance Model fo... compliance accuracy

We formalize a Pre-Action Governance Reasoning Loop (PAGRL) in which agents consult a four-layer governance rule set: global, workflow-specific, agent-specific, and situational before every consequential action.

Methodological contribution described in the paper (formalization of a governance loop and four-layer rule hierarchy; no numerical sample given in the abstract).

high positive Think Before You Act -- A Neurocognitive Governance Model fo... use of a four-layer rule consultation prior to consequential actions

We propose a neurocognitive governance framework that formally maps this human self-governance process to LLM-driven agent reasoning, establishing a structural parallel between the human brain and the large language model as the cognitive core of an agent.

Theoretical framework and formal mapping presented in the paper (design/proposal rather than empirical validation).

high positive Think Before You Act -- A Neurocognitive Governance Model fo... alignment of agent reasoning structure with human self-governance (conceptual ma...

Before acting, humans engage deliberate cognitive processes grounded in executive function, inhibitory control, and internalized organizational rules to evaluate whether an intended action is permissible, requires modification, or demands escalation.

The paper's framing draws on cognitive/neurocognitive literature about human self-governance (presented as background/theoretical justification; no new empirical human-subject data reported in the abstract).

high positive Think Before You Act -- A Neurocognitive Governance Model fo... human pre-action deliberative cognitive processes (executive function, inhibitor...

The authors conclude that these findings have implications for responsible and perceptible genAI use in hiring contexts.

Authors' conclusions/recommendations based on the interview findings and analysis.

high positive Resume-ing Control: (Mis)Perceptions of Agency Around GenAI ... need for responsible/perceptible genAI adoption practices

Participants reported only marginal efficiency gains from genAI despite a seemingly seismic shift in how recruiting happens.

Self-reports from 22 interviewed recruiting professionals indicating small/marginal efficiency improvements.

high positive Resume-ing Control: (Mis)Perceptions of Agency Around GenAI ... efficiency gains / task completion efficiency

Individual recruiters also felt compelled to adopt genAI because of the personal need to boost productivity.

Qualitative interview responses (n=22) reporting individual-level productivity motivations for using genAI.

high positive Resume-ing Control: (Mis)Perceptions of Agency Around GenAI ... self-reported motivation to adopt for productivity gains

Recruiters often felt compelled to adopt genAI to combat applicant use of AI.

Interview data from 22 recruiting professionals reporting adoption motivations tied to applicants' AI use.

high positive Resume-ing Control: (Mis)Perceptions of Agency Around GenAI ... motivation for adoption related to applicant behavior

When generative AI (genAI) systems are used in high-stakes decision-making, its recommended role is to aid, rather than replace, human decision-making.

Normative statement presented in the paper (literature/theoretical recommendation), no empirical data reported to support this recommendation within the study.

high positive Resume-ing Control: (Mis)Perceptions of Agency Around GenAI ... recommended role of genAI in decision-making (augmentation vs. replacement)

AI-mediated expert networks are an emerging phenomenon that existing coordination theories fail to account for.

Mentioned as an example in the abstract to motivate theoretical gap; no empirical data or sample provided.

high positive Beyond markets and hierarchies: How GenAI enables unbounded ... performance or coordination of expert networks mediated by AI

GitHub Copilot exhibits 'recursive value creation' as an example of an emerging organizational phenomenon enabled by GenAI.

Illustrative example named in the abstract; no empirical measurement or sample reported within the abstract.

high positive Beyond markets and hierarchies: How GenAI enables unbounded ... developer productivity and value creation dynamics (implied)

UCF provides a theoretical foundation for understanding organizational coordination when GenAI transforms cognitive constraints from scarce to abundant resources.

Position paper asserts UCF as foundational theory for coordination under transformed cognitive constraints; conceptual argument only.

high positive Beyond markets and hierarchies: How GenAI enables unbounded ... ability to explain coordination under changed cognitive constraint regimes

Three emergent organizational forms illustrate UCF principles: cognitive meshworks (coordinated through competence synthesis), algorithmic ecosystems (achieving emergent optimization), and hybrid intelligence collectives (operating through cognitive complementarity).

Conceptual typology and illustrative examples in the position paper; no reported empirical measurement or sample.

high positive Beyond markets and hierarchies: How GenAI enables unbounded ... emergence of new organizational forms under GenAI

« Prev 1 2 3 … 65 66 67 … 129 130 Next »