Evidence (6491 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

This tension reveals a pattern we call 'bounded delegation': developers wanted AI to absorb the assembly work surrounding their craft, never the craft itself.

Interpretive result from the paper's qualitative thematic analysis of survey responses (n=860), labeled by the authors as the 'bounded delegation' pattern.

high positive To Copilot and Beyond: 22 AI Systems Developers Want Built preferred boundary of automation / delegation

Developers wanted systems enforcing explicit authority scoping, provenance, uncertainty signaling, and least-privilege access throughout.

Reported constraints and desiderata from the thematic analysis of survey responses (n=860).

high positive To Copilot and Beyond: 22 AI Systems Developers Want Built desired governance/security features for AI tools (authority scoping, provenance...

Developers wanted systems that embed quality signals earlier in their workflow to keep pace with accelerating code generation.

Thematic findings from the paper's human-in-the-loop, multi-model council-based analysis of survey responses (n=860).

high positive To Copilot and Beyond: 22 AI Systems Developers Want Built requested placement/timing of quality signals in developer workflow

Using a human-in-the-loop, multi-model council-based thematic analysis, we identify 22 AI systems that developers want built across five task categories.

Qualitative analysis method described in the paper applied to the survey responses (n=860); result reported as identification of 22 desired AI systems organized into five categories.

high positive To Copilot and Beyond: 22 AI Systems Developers Want Built catalog of desired AI systems and task categories

BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility.

Design claim in abstract describing the benchmark's automated scoring system and rubric size (100+ criteria) defined by expert bankers.

high positive BankerToolBench: Evaluating AI Agents in End-to-End Investme... number of rubric criteria for automated evaluation

For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict

Explicit reproducibility statement and URL provided in the paper.

high positive SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... data_and_code_availability

SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process?

Statement of research goals and scope in the paper introducing the SciPredict benchmark and accompanying evaluations.

high positive SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... research_questions_addressed

Human experts demonstrate strong calibration: their accuracy increases from ≈5% to ≈80% as they deem outcomes more predictable without conducting the experiment.

Reported stratified accuracy of human experts on SciPredict tasks by self-reported predictability judgments; accuracy rises from ≈5% (when judged not predictable) to ≈80% (when judged predictable).

high positive SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... calibration_of_human_confidence_vs_accuracy

We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry.

Construction of the SciPredict benchmark described in the paper; explicitly reports 405 tasks and 33 sub-fields.

high positive SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... benchmark_size_and_scope

The future of Nagpur's industrial belt depends not on resisting automation, but on an aggressive reskilling strategy to bridge the gap between current workforce capabilities and future technological requirements.

Normative policy conclusion in the paper recommending reskilling as the primary response; based on the paper's analysis of task changes and projected role shifts; no program evaluation or empirical evidence of reskilling effectiveness reported in the excerpt.

high positive PREDICTING THE FUTURE OF JOBS IN NAGPUR DISTRICT MIDC: THE R... need for reskilling / workforce skill acquisition

There is a projected surge in demand for 'AI-collaborative' roles such as machine maintenance, data supervision, and process optimization.

Projection in the paper based on analysis of task complementarities between humans and AI, listing specific roles expected to grow; no quantitative demand estimates or sample sizes provided in the excerpt.

high positive PREDICTING THE FUTURE OF JOBS IN NAGPUR DISTRICT MIDC: THE R... projected demand for AI-collaborative roles (machine maintenance, data supervisi...

A configuration-driven domain model means deploying a new institutional decision domain requires YAML configuration, not engineering capacity.

Design/implementation claim in paper describing deployment approach using YAML configuration rather than engineering work.

high positive Governed Reasoning for Institutional AI deployment effort required to support a new institutional decision domain

We introduce governability — how reliably a system knows when it should not act autonomously — as a primary evaluation axis for institutional AI alongside accuracy.

Conceptual contribution/metric proposed by authors in paper; no empirical validation reported in the excerpt.

high positive Governed Reasoning for Institutional AI governability (system's ability to know when not to act autonomously)

Cognitive Core produced zero silent errors while both baselines produced 5-6 silent errors on the evaluation set.

Empirical benchmark reported in paper on the 11-case evaluation set; counts of silent errors given for Cognitive Core and baselines.

high positive Governed Reasoning for Institutional AI count of silent errors (incorrect determinations that executed without human-rev...

Cognitive Core achieves 91% accuracy on the 11-case prior authorization appeal set, versus 55% for ReAct and 45% for Plan-and-Solve.

Empirical benchmark reported in paper on the 11-case evaluation set; accuracies explicitly stated for three systems.

high positive Governed Reasoning for Institutional AI accuracy on prior authorization appeal cases

We propose Cognitive Core: a governed decision substrate built from nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), a four-tier governance model where human review is a condition of execution rather than a post-hoc check, a tamper-evident SHA-256 hash-chain audit ledger endogenous to computation, and a demand-driven delegation architecture supporting both declared and autonomously reasoned epistemic sequences.

Design/proposal described in paper (architectural specification); no empirical evaluation reported for the architecture itself in the excerpt.

high positive Governed Reasoning for Institutional AI system governability and auditability as properties of the decision substrate

Organisations should invest in customisation capabilities for AI recruitment tools, implement comprehensive change management strategies, and maintain robust post-hire evaluation procedures.

Authors' recommendations derived from thematic findings and participant perspectives across two firms (qualitative synthesis of n = 22 interviews).

high positive The augmented recruiter: examining AI integration and decisi... recommended_organisational_practices_for_AI_recruitment

AI functioned optimally as an augmentative technology rather than as a replacement for human decision-makers in recruitment.

Findings: participants across the two case firms described AI being most effective when augmenting human judgment rather than replacing it (interviews n = 22).

high positive The augmented recruiter: examining AI integration and decisi... role_of_AI (augmentation vs replacement)

AI significantly enhanced efficiency through process standardisation and automation.

Findings based on participant accounts in thematic analysis (interviews n = 22) describing process optimisation and automation benefits.

high positive The augmented recruiter: examining AI integration and decisi... efficiency (process standardisation and automation)

Participants in the treatment conditions showed greater positive belief change about the AI across the session.

Pre/post measures of participant beliefs collected during the field experiment (N=388) showing larger positive shifts among those assigned to treatment conditions versus controls.

high positive Scaffolding Human-AI Collaboration: A Field Experiment on Be... change in participant beliefs about AI (pre/post)

A cognitive scaffolding intervention (partnership training that reframed AI as a thought partner) was associated with higher individual document quality at the top of the distribution.

Field experiment with 388 employees comparing cognitive scaffolding to other conditions; reported improvements concentrated at the top of the individual document-quality distribution.

high positive Scaffolding Human-AI Collaboration: A Field Experiment on Be... individual document quality (top of the distribution)

LLMs coordinate extremely well on similar actions.

Empirical observation from the experiment showing high coordination performance by LLMs when alignment on similar actions is the equilibrium; qualitative description in the abstract without reported quantitative metrics.

high positive Strategic Algorithmic Monoculture:Experimental Evidence from... coordination success when similar actions are favored

Like humans, [LLMs] regulate [action similarity] in response to coordination incentives (strategic monoculture).

Empirical claim based on experimental results comparing how humans and LLMs change similarity when incentives for coordination/divergence are manipulated. No numerical details in excerpt.

high positive Strategic Algorithmic Monoculture:Experimental Evidence from... change in action similarity in response to incentives

LLMs exhibit high levels of baseline similarity (primary monoculture).

Empirical observation from the experiment comparing baseline action similarity across LLM subjects (relative level described qualitatively in paper). Specific sample sizes and quantitative metrics not provided in the excerpt.

high positive Strategic Algorithmic Monoculture:Experimental Evidence from... action similarity (baseline)

We implement a simple experimental design that cleanly separates these forces, and deploy it on human and large language model (LLM) subjects.

Methodological claim: authors report implementing an experiment that separates baseline similarity from strategic adjustments and applying it to human participants and LLM agents. No sample sizes or procedural details provided in the excerpt.

high positive Strategic Algorithmic Monoculture:Experimental Evidence from... experimental implementation (ability to separate primary vs strategic monocultur...

We distinguish primary algorithmic monoculture -- baseline action similarity -- from strategic algorithmic monoculture, whereby agents adjust similarity in response to incentives.

Conceptual/theoretical distinction proposed in the paper (definition and taxonomy introduced by the authors). No empirical sample size reported for this conceptual claim in the provided text.

high positive Strategic Algorithmic Monoculture:Experimental Evidence from... definition/separation of two forms of algorithmic monoculture (primary vs strate...

In a test of eight behavioural persuasion strategies, all outperformed the most effective attitudinal persuasion strategy, but differences among the eight were small.

Experimental comparison within the preregistered studies of eight behavioural persuasion strategies versus the best attitudinal persuasion strategy; results reported in paper showing each behavioural strategy exceeded the attitudinal strategy and that variation among the eight behavioural strategies was small.

high positive Artificial intelligence can persuade people to take politica... behavioural persuasion effectiveness (various behavioural outcomes such as petit...

We replicated prior findings that information provision drove effects on attitudes.

Experimentally manipulating information provision within the preregistered studies and observing effects on attitudinal outcomes, consistent with prior literature (sample reported in paper).

high positive Artificial intelligence can persuade people to take politica... attitudinal change (attitudes)

We found sizable AI persuasion effects on these behavioural outcomes (e.g. +19.7 percentage points on petition signing).

Experimental results from the two preregistered studies reported in the paper; example effect explicitly reported as +19.7 percentage points increase in petition signing. Overall sample reported as N=17,950 responses.

high positive Artificial intelligence can persuade people to take politica... petition signing (real petition signing behavior)

Organizations that strategically invest in blended, context-rich, and partnership-based development programs position themselves for sustainable competitive advantage in an increasingly automated marketplace.

Normative recommendation supported by the paper's synthesis of theory and practice (organizational development, adult learning, workforce development); no empirical effect sizes or sample-size-based evaluation provided.

high positive The Future of Education in an AI-Driven World: Preparing Org... positioning for sustainable competitive advantage (organizational performance ad...

Forward-thinking organizations are redesigning learning architectures to cultivate irreplaceable human capabilities that complement rather than compete with AI systems.

Synthesis of literature from organizational psychology, adult learning theory, and workforce development practice cited in the paper; presented as descriptive statement about current organizational practice rather than based on a reported empirical study with sample size.

high positive The Future of Education in an AI-Driven World: Preparing Org... redesign of learning architectures to cultivate human capabilities (critical thi...

Corporate and academic learning ecosystems will converge (necessary convergence of corporate and academic learning ecosystems).

Conceptual synthesis and argumentation in the paper referencing workforce development practice and organizational development research; no quantitative measures or sample size reported.

high positive The Future of Education in an AI-Driven World: Preparing Org... convergence/integration between corporate and academic learning ecosystems

Human skills (critical thinking, adaptive decision-making, interpersonal acumen) will be elevated to core competency status as AI automates technical tasks once considered core competencies.

Argument and synthesis presented in the paper drawing on organizational psychology, adult learning theory, and workforce development practice; no empirical sample size or statistical tests reported (conceptual/literature-based claim).

high positive The Future of Education in an AI-Driven World: Preparing Org... elevation of human skills to core competencies (critical thinking, adaptive deci...

A machine-learning research agenda is needed centered on team-level evaluation, privacy-preserving memory layers, scaffolded AI for learning, carbon-aware routing, and pro-agency workflow design.

Prescriptive recommendation in the position paper proposing specific research priorities; no empirical evaluation of these approaches is presented within the paper itself.

high positive Remote-Capable Knowledge Work Should Default to AI-Enabled F... prioritized ML research directions and interventions (team-level evaluation, pri...

Rather than eliminating the office, this shift supports selective co-presence, reserving in-person time for tasks with high tacitness, high coupling, or high relational stakes (including apprenticeship, conflict repair, trust formation, and early-stage synthesis).

Theoretical/qualitative argument about task types best suited for in-person interaction; illustrated by examples (apprenticeship, conflict repair, trust formation, early-stage synthesis); no empirical task-level allocation study presented.

high positive Remote-Capable Knowledge Work Should Default to AI-Enabled F... allocation of in-person vs. remote time for specific task types

Capabilities that are already widely deployed—transcription, summarization, retrieval, translation, drafting, and code assistance—are the basis for this shift (with bounded agents as an amplifying but not necessary extension).

Descriptive claim citing the prevalence of specific AI capabilities in current deployments; presented as observation in the position paper rather than as a quantified adoption study.

high positive Remote-Capable Knowledge Work Should Default to AI-Enabled F... deployment/adoption of specific AI capabilities (transcription, summarization, r...

The organizational significance of these systems is not generic automation but the accumulation of artifact capital: durable, queryable, reusable traces such as transcripts, summaries, decisions, tickets, code comments, and retrieval layers.

Argumentative claim in the paper describing a conceptual mechanism ('artifact capital') by which foundation-model features create reusable organizational artifacts; no empirical measurement of artifact capital provided.

high positive Remote-Capable Knowledge Work Should Default to AI-Enabled F... accumulation and reuse of organizational knowledge artifacts ('artifact capital'...

The foundation-model stack (NL interaction, multimodal capture, long context, retrieval, transcription, translation, bounded tool use) changes the coordination economics that previously favored daily in-person co-presence.

Conceptual claim supported by descriptions of foundation-model capabilities and their potential to create durable, queryable artifacts; no empirical test or measured coordination-costs reported.

high positive Remote-Capable Knowledge Work Should Default to AI-Enabled F... coordination economics (costs/benefits of co-presence vs. remote work)

Remote-capable knowledge work should default to AI-enabled flexibility because the workflow-integrated foundation-model stack changes the coordination economics that once favored daily co-presence.

Normative argument in the position paper based on conceptual analysis of coordination economics and the claimed effects of foundation-model features; no empirical sample or quantitative study reported.

high positive Remote-Capable Knowledge Work Should Default to AI-Enabled F... defaulting remote-capable knowledge work to AI-enabled flexible arrangements (i....

Preliminary corroboration is provided by a companion production automation system with eleven operating lanes and 2,132 classified tickets.

Reported companion system operational statistics in the paper (11 lanes, 2,132 tickets).

high positive Context Engineering: A Practitioner Methodology for Structur... companion system scale and classified tickets

When iteration was permitted, the final success rate for the structured interactions reached 91.5% (183 of 200).

Reported final success counts/rate in the paper for structured interactions (183 of 200).

high positive Context Engineering: A Practitioner Methodology for Structur... final success rate after iteration

Among structured interactions, 110 of 200 were accepted on first pass.

Reported counts in the paper for the structured-interaction group (110 accepted of 200 structured interactions).

high positive Context Engineering: A Practitioner Methodology for Structur... first-pass acceptances (count and rate)

Structured context assembly was associated with an improvement in first-pass acceptance from 32% to 55%.

Observational comparison reported in the paper (baseline vs. structured first-pass acceptance rates are given as 32% and 55%).

high positive Context Engineering: A Practitioner Methodology for Structur... first-pass acceptance rate

Structured context assembly was associated with a reduction from 3.8 to 2.0 average iteration cycles per task.

Observational comparison reported in the paper (structured vs. baseline interactions); the paper states the 3.8 to 2.0 cycle figures.

high positive Context Engineering: A Practitioner Methodology for Structur... average iteration cycles per task

The paper applies formal models from reliability engineering and information theory as post hoc interpretive lenses on context quality.

Paper text claiming the application of these formal models for interpretation.

high positive Context Engineering: A Practitioner Methodology for Structur... use of formal theoretical models

Context Engineering applies a staged four-phase pipeline (Reviewer to Design to Builder to Auditor).

Methodological description in the paper listing the four pipeline phases.

high positive Context Engineering: A Practitioner Methodology for Structur... pipeline/phases defined

Context Engineering defines a five-role context package structure (Authority, Exemplar, Constraint, Rubric, Metadata).

Explicit specification in the paper of the five-role package components.

high positive Context Engineering: A Practitioner Methodology for Structur... structure/components of context package

This paper introduces Context Engineering, a structured methodology for assembling, declaring, and sequencing the complete informational payload that accompanies a prompt to an AI tool.

Methodological description in the paper (definition and presentation of the Context Engineering approach).

high positive Context Engineering: A Practitioner Methodology for Structur... existence/definition of a structured prompting methodology

The review integrates fragmented literature into a cohesive framework and offers implications for managers and policymakers to pursue more balanced, inclusive, and context-sensitive AI adoption strategies.

Author-stated contribution of the review based on synthesis of the 40 included studies; normative recommendations derived from the review.

high positive Generative AI in the Workplace: A Systematic Review of Produ... guidance for managerial and policy decision-making regarding AI adoption

Generative AI adoption is associated with mixed employee perceptions: some studies report increased efficiency and higher job satisfaction.

Aggregate finding from included studies in the review that report positive employee-reported outcomes (efficiency, satisfaction).

high positive Generative AI in the Workplace: A Systematic Review of Produ... reported efficiency gains and job satisfaction

« Prev 1 2 3 … 74 75 76 … 129 130 Next »