Evidence (5157 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	736	1615
Governance & Regulation	664	329	160	99	1273
Organizational Efficiency	624	143	105	70	949
Technology Adoption Rate	502	176	98	78	861
Research Productivity	348	109	48	322	836
Output Quality	391	120	44	40	595
Firm Productivity	385	46	85	17	539
Decision Quality	275	143	62	34	521
AI Safety & Ethics	183	241	59	30	517
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	105	40	6	187
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	78	8	1	151
Regulatory Compliance	69	64	14	3	150
Training Effectiveness	81	15	13	18	129
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

The best-performing agent reaches only 68.7% on the benchmark.

Experimental results reported by the authors (evaluation across tasks/rubrics).

high negative Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tas... benchmark score (agent performance)

These industry visions have implications for human experts, whose professional lives may be transformed and revalued by the expert-annotation industry.

Synthesis and interpretation of themes from public statements by five data-annotation firms and CEOs; authors draw implications for professionals based on observed framings and industry positioning.

high negative Cheap Expertise: Mapping and Challenging Industry Perspectiv... professional transformation and revaluation of human experts (risk of role chang...

Human expertise is viewed by the industry as an extractable resource whose value can be judged relative to AI expertise.

The paper's thematic analysis of public-facing statements from five annotation firms/CEOs showing language that frames human expertise as a resource to be extracted and monetized for AI training.

high negative Cheap Expertise: Mapping and Challenging Industry Perspectiv... valuation and treatment of human expertise (commodification/extraction)

The industry envisions AI expertise as cheap, meaning that it can offer a better return on investment than human expertise.

Interpretive coding of statements from five data-annotation firms and their CEOs on social media and podcasts indicating that AI-based expertise is framed as lower-cost and higher-ROI relative to human experts.

high negative Cheap Expertise: Mapping and Challenging Industry Perspectiv... relative valuation/price of AI expertise versus human expertise (implications fo...

These dynamics may produce an asymmetric barbell-shaped structure of value capture in advanced economies: high-volume synthetic production controlled by owners of AI infrastructure at one pole, and scarce, high-status human labor valued for verified human presence at the other.

Conceptual projection and economic argument in the paper (no empirical decomposition, distributional statistics, or sample reported in the excerpt).

high negative Human-Provenance Verification should be Treated as Labor Inf... concentration of value capture across economic actors (inequality / distribution...

AI compresses the value of standardized middle-tier labor by making good-enough synthetic substitutes scalable at low marginal cost, hollowing out the middle of the skill distribution currently categorized by knowledge work.

Conceptual/theoretical argument presented in the paper (no reported empirical sample, statistical analysis, or quantified experiment in the excerpt).

high negative Human-Provenance Verification should be Treated as Labor Inf... value of standardized middle-tier knowledge work (wages / scarcity premiums)

This concentration can diffuse responsibility and raise the probability of irreversible system-level loss even when local per-action error rates remain low.

Theoretical result/argument from the model linking concentrated decision-energy to increased systemic risk despite low local error rates.

high negative AI Safety as Control of Irreversibility: A Systems Framework... probability of irreversible system-level loss

Efficiency pressure, path dependence, scale feedback, and weak boundary constraints concentrate decision-energy in the most efficient node.

Derived from the paper's formal model and argumentation about system dynamics (efficiency and feedback mechanisms); theoretical rather than empirical evidence.

high negative AI Safety as Control of Irreversibility: A Systems Framework... concentration of decision-energy (centralization of decision authority)

Declining deployment friction changes the safety problem at its root: safety is not only local output correctness or preference alignment, but the control of irreversibility under rising decision density.

Main theoretical argument of the paper; supported by conceptual framing and a formal model that introduces decision-density considerations.

high negative AI Safety as Control of Irreversibility: A Systems Framework... safety framing (control of irreversibility)

Recent AI systems compress the distance between capability growth and capability deployment.

Conceptual and descriptive claim in the paper's introduction; supported by theoretical argumentation and illustrative examples rather than empirical measurement.

high negative AI Safety as Control of Irreversibility: A Systems Framework... deployment speed / adoption

A full-transparency intervention establishes that information exchange alone is insufficient: the bottleneck lies in the interactive processes of joint plan formation, commitment, and execution that constitute dynamic grounding.

Experimental intervention with full transparency of information between agents; authors report that even with full information exchange, dyads fail to reach optimal coordination, pointing to interactive grounding processes as the bottleneck.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... coordination performance under full information transparency

The oracle baseline establishes that the coordination gap is not attributable to individual reasoning limitations.

Experimental baseline (oracle) in which individual reasoning is isolated and shown to be sufficient for identifying optimal allocations; details/sizes not given in the abstract.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... attribution of coordination gap to individual reasoning limitations

Failures in referential binding occur, where agents lose track of commitments across turns.

Reported failure mode from multi-turn experiments: referential binding breakdowns leading to loss of commitments.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... referential binding / tracking of commitments across turns

Agents rely on perfunctory fairness (equal resource splits) over reward-maximizing coordination.

Empirical observation from negotiation experiments where agents prefer equal splits rather than allocations that maximize joint reward, as reported in the paper.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... allocation strategy preference (equal split vs reward-maximizing)

Accumulated context can itself become a liability through stubborn anchoring, where initial proposals are treated as axiomatic rather than negotiable.

Observed failure mode in multi-turn negotiation experiments: agents anchor on initial proposals and fail to revise, as reported by the authors.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... propensity to revise initial proposals / anchoring behavior

Coordination degrades when shared interaction history is absent.

Experimental comparison of settings with and without shared interaction history (ablation showing worse coordination when history is removed).

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... coordination performance as a function of shared interaction history

While individual agents can identify Pareto-optimal allocations in isolation, agent dyads consistently fail to reach them across open- and closed-source models.

Experimental results comparing single-agent (isolated) performance and paired-agent (dyad) negotiation performance across multiple LLMs (open- and closed-source); specific sample sizes not reported in the abstract.

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... achievement of Pareto-optimal allocations in dyadic negotiation

Current multi-agent LLM benchmarks focus on static, one-shot tasks, overlooking the ability to repair grounding breakdowns across turns.

Literature/benchmark survey claim by the authors (asserted in the paper; no numeric summary provided here).

high negative Talk is Cheap, Communication is Hard: Dynamic Grounding Fail... coverage of dynamic grounding in benchmarks

Of these four, integration capacity is the least developed for scientific institutions and the most binding: no improvement in AI tooling can buy it.

Normative/diagnostic claim in the paper about relative scarcity and irreducibility of integration capacity; no empirical measures or sample provided in the excerpt.

high negative AI-Augmented Science and the New Institutional Scarcities relative development of integration capacity in scientific institutions and its ...

Four complements then become scarce and load-bearing for AI-augmented science: verified signal, legitimacy, authentic provenance, and integration capacity (the community's tolerance for delegated cognition).

Theoretical framework proposed by the paper; list of four complements presented as an argument without empirical quantification in the excerpt.

high negative AI-Augmented Science and the New Institutional Scarcities scarcity of verified signal, legitimacy, authentic provenance, and integration c...

We establish a Volume-Quality Inverse Law: code volume is a near perfect predictor of structural degradation.

Empirical finding from the paper's analysis correlating code volume with measures of structural degradation; described as 'near perfect predictor'.

high negative AI-Generated Smells: An Analysis of Code and Architecture in... structural degradation (predicted by code volume)

There exists a fundamental Reasoning-Complexity Trade-off: as models become more capable, they generate increasingly bloated and coupled code.

Multi-scale comparative analysis across models of differing capability showing higher-capability models produce larger (volume) and more highly-coupled code artifacts.

high negative AI-Generated Smells: An Analysis of Code and Architecture in... code volume and coupling (architectural complexity)

AI does not eliminate software flaws but rather introduces a distinct 'machine signature' of defects in generated code.

Systematic audit (multi-scale analysis) of AI-generated software across single-file algorithmic tasks and complex, agent-generated systems, reporting characteristic defect patterns attributed to machine generation.

high negative AI-Generated Smells: An Analysis of Code and Architecture in... presence and patterning of defects in AI-generated code (machine signature of de...

The promise of Large Language Models in automated software engineering is often measured by functional correctness, overlooking the critical issue of long term maintainability.

Framing statement in the paper; argument based on literature/practice that current evaluations emphasize functional correctness rather than maintainability.

high negative AI-Generated Smells: An Analysis of Code and Architecture in... emphasis of evaluation metrics (functional correctness vs maintainability)

Frontier software engineering agents have saturated short-horizon benchmarks while regressing on the work that constitutes senior engineering: long-horizon, multi-engineer, ambiguous-specification deliverables.

Position asserted in the paper based on literature/benchmark trends and authors' field observations; no original empirical dataset or quantified analysis provided in the paper text excerpt.

high negative The Conversations Beneath the Code: Triadic Data for Long-Ho... performance on short-horizon benchmarks versus performance on long-horizon, mult...

Prior work finds that hard-only constraints are too rigid, and numeric flexibility weights confuse users.

Cited prior work / literature claim reported in paper (no specific study details or sample sizes provided in excerpt).

high negative U-Define: Designing User Workflows for Hard and Soft Constra... usability of constraint specification (rigidity and understandability of numeric...

LLMs are increasingly used for end-user task planning, yet their black-box nature limits users' ability to ensure reliability and control.

Paper's background/related-work motivation (literature summary and framing). No specific empirical data reported in excerpt.

high negative U-Define: Designing User Workflows for Hard and Soft Constra... reliability and control over LLM outputs

The most valuable AI capabilities (reasoning, judgment, intuition) are precisely those we cannot verify with current methods.

Argumentative claim in the position paper linking capability value to unverifiability; no empirical validation or measurement of 'value' or verifiability included.

high negative Reliable AI Needs to Externalize Implicit Knowledge: A Human... verifiability of high-level AI capabilities (reasoning, judgment, intuition)

Current reliability methods can only verify explicit knowledge against sources, creating a fundamental gap in verifying AI's implicit knowledge.

Conceptual critique in the paper of existing verification/validation approaches; no systematic review or empirical comparison provided.

high negative Reliable AI Needs to Externalize Implicit Knowledge: A Human... verifiability of AI knowledge (explicit vs implicit)

Implicit knowledge remains unexternalized because documentation cost exceeds perceived value.

Presented as an economic/theoretical explanation in the paper; no empirical study, sample, or cost estimates provided.

high negative Reliable AI Needs to Externalize Implicit Knowledge: A Human... degree of externalization of implicit knowledge (documentation vs tacit retentio...

Specification discipline, not model capability, is the binding constraint on AI-assisted software dependability.

Synthesis conclusion by the authors based on the multivocal literature review, telemetry findings, conceptual modeling (PRP/SGM), and the four-month pilot evaluation.

high negative The Productivity-Reliability Paradox: Specification-Driven G... software dependability (reliability) in AI-assisted development

These conflicting findings constitute the Productivity-Reliability Paradox (PRP): a systematic phenomenon emerging from non-deterministic code generators and insufficient specification discipline.

Conceptual synthesis and interpretation by the paper's authors, based on the multivocal literature review, telemetry, and experimental evidence summarized above.

high negative The Productivity-Reliability Paradox: Specification-Driven G... software dependability / trade-off between productivity and reliability

Telemetry across 10,000+ developers shows 91% longer code review times.

Observational telemetry data aggregated across >10,000 developers reported in the paper; metric reported is percent increase in review time.

high negative The Productivity-Reliability Paradox: Specification-Driven G... code review time

The most rigorous randomized controlled trial (RCT) documents a 19% slowdown for experienced developers.

A single RCT cited in the paper described as the most rigorous trial; result reported as a 19% slowdown for experienced developers. Sample size for the RCT is not provided in the summary statement.

high negative The Productivity-Reliability Paradox: Specification-Driven G... developer productivity (task completion speed)

Making LLMs themselves explicitly Bayesian belief-updating engines remains computationally intensive and conceptually nontrivial as a general modeling target

Stated as a limitation in the paper (conceptual and computational argument); no benchmarks or computational cost measurements reported.

high negative Position: agentic AI orchestration should be Bayes-consisten... computational feasibility and conceptual tractability of making LLMs fully Bayes...

Keeping humans in the loop can sometimes make the decision worse.

Argumentative/diagnostic statement in the paper (theoretical assertion; no experimental or observational effect sizes reported in the excerpt).

high negative Leading Across the Spectrum of Human-AI Relationships: A Con... decision quality when humans are kept in the loop

Leaders may believe oversight remains meaningful when it has become ceremonial.

Conceptual warning in the paper about erosion of meaningful oversight (no empirical validation provided in the excerpt).

high negative Leading Across the Spectrum of Human-AI Relationships: A Con... meaningfulness/effectiveness of oversight

The central risk is misrecognition: leaders may keep a human-centered story in place after decision-shaping authority has shifted elsewhere (e.g., to AI).

Analytic/diagnostic claim in the paper (conceptual warning; no empirical sample or measured incidence provided).

high negative Leading Across the Spectrum of Human-AI Relationships: A Con... degree of accurate recognition of who holds decision-shaping authority

Current AI agents implement only the first half of CLS (fast exemplar/hippocampal-style storage) and lack the slow weight-consolidation half.

Analytic claim in paper comparing current AI agent designs to CLS; no empirical evaluation reported in abstract.

high negative Contextual Agentic Memory is a Memo, Not True Memory presence/absence of slow weight-consolidation mechanisms in AI agents

Agents that rely only on lookup are structurally vulnerable to persistent memory poisoning as injected content propagates across all future sessions.

Theoretical/security argument presented in paper; claims about propagation of injected content across sessions; no empirical attack experiments detailed in abstract.

high negative Contextual Agentic Memory is a Memo, Not True Memory vulnerability to persistent memory poisoning

Conflating the two produces agents that face a provable generalization ceiling on compositionally novel tasks that no increase in context size or retrieval quality can overcome.

Formal claim asserted in paper (formalization of limitations and proofs claimed); no empirical sample detailed in abstract.

high negative Contextual Agentic Memory is a Memo, Not True Memory generalization performance on compositionally novel tasks

Conflating retrieval and weight-based memory produces agents that accumulate notes indefinitely without developing expertise.

Theoretical argument/formalization presented in paper; claim based on analysis of how lookup-only systems fail to consolidate abstract knowledge; no empirical sample reported in abstract.

high negative Contextual Agentic Memory is a Memo, Not True Memory expertise development / continued accumulation of notes

Treating lookup as memory is a category error with provable consequences for security.

Theoretical/formal argument and formalization in paper; security consequences (e.g., persistent poisoning) claimed; no empirical sample reported in abstract.

high negative Contextual Agentic Memory is a Memo, Not True Memory security (vulnerability to persistent memory poisoning)

Treating lookup as memory is a category error with provable consequences for long-term learning.

Theoretical/formal argument asserted in the paper, drawing on formalization and Complementary Learning Systems theory; no empirical sample reported in abstract.

high negative Contextual Agentic Memory is a Memo, Not True Memory long-term learning

Treating lookup as memory is a category error with provable consequences for agent capability.

Theoretical/formal argument asserted in the paper (formalization and proofs claimed); no empirical sample reported in abstract.

high negative Contextual Agentic Memory is a Memo, Not True Memory agent capability

Current agentic memory systems (vector stores, retrieval-augmented generation, scratchpads, and context-window management) do not implement memory: they implement lookup.

Conceptual/analytic claim stated in paper; supported by comparison of existing agent memory mechanisms (vector stores, RAG, scratchpads, context-window management) to the paper's definition of 'memory'. No empirical sample reported.

high negative Contextual Agentic Memory is a Memo, Not True Memory whether systems implement memory vs. lookup

Novices more often experience invisible failures: conversations that appear to end successfully but in fact miss the mark.

Annotation-based comparison in the 27K WildChat transcript sample indicating higher rates of 'invisible' failures (apparent successes that are actually incorrect or insufficient) among novice users.

high negative A paradox of AI fluency invisible failure rate (apparent success but incorrect outcome)

Fluent users experience more failures than novices.

Quantitative comparison of failure occurrences across user-fluency strata in the 27K annotated transcript sample from WildChat-4.8M.

high negative A paradox of AI fluency failure rate (errors / failed turns)

Reactive approaches paired with automation or creation produced breakdowns (reduced effectiveness).

Thematic evidence from interviewees describing instances where reactive leadership combined with high automation-or-creation use led to coordination or accountability breakdowns across the 34 cases.

high negative E-leadership and human-AI collaboration: socio-technical ali... perceived team effectiveness (breakdowns)

Workers acquire skills through generative AI tools but lack credible ways to signal or validate these skills in competitive freelance markets (a structural challenge the paper terms 'invisible competencies').

Reported finding and conceptual contribution based on the paper's mixed-methods study (survey + semi-structured interviews).

high negative Upskilling with Generative AI: Practices and Challenges for ... ability to signal/validate skills acquired via generative AI in freelance market...

« Prev 1 2 3 … 8 9 10 … 103 104 Next »