Evidence (5267 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	378	106	59	455	1007
Governance & Regulation	379	176	116	58	739
Research Productivity	240	96	34	294	668
Organizational Efficiency	370	82	63	35	553
Technology Adoption Rate	296	118	66	29	513
Firm Productivity	277	34	68	10	394
AI Safety & Ethics	117	177	44	24	364
Output Quality	244	61	23	26	354
Market Structure	107	123	85	14	334
Decision Quality	168	74	37	19	301
Fiscal & Macroeconomic	75	52	32	21	187
Employment Level	70	32	74	8	186
Skill Acquisition	89	32	39	9	169
Firm Revenue	96	34	22	—	152
Innovation Output	106	12	21	11	151
Consumer Welfare	70	30	37	7	144
Regulatory Compliance	52	61	13	3	129
Inequality Measures	24	68	31	4	127
Task Allocation	75	11	29	6	121
Training Effectiveness	55	12	12	16	96
Error Rate	42	48	6	—	96
Worker Satisfaction	45	32	11	6	94
Task Completion Time	78	5	4	2	89
Wages & Compensation	46	13	19	5	83
Team Performance	44	9	15	7	76
Hiring & Recruitment	39	4	6	3	52
Automation Exposure	18	17	9	5	50
Job Displacement	5	31	12	—	48
Social Protection	21	10	6	2	39
Developer Productivity	29	3	3	1	36
Worker Turnover	10	12	—	3	25
Skill Obsolescence	3	19	2	—	24
Creative Output	15	5	3	1	24
Labor Share of Income	10	4	9	—	23

Adoption Remove filter

The ceiling discrimination probe used Gemini Pro (Google) and Copilot Pro (Microsoft) as independent judges.

Methods: reported use of Gemini Pro and Copilot Pro as independent judges for the ceiling probe.

high null result Literary Narrative as Moral Probe : A Cross-System Framework... agents used for ceiling-probe adjudication (Gemini Pro, Copilot Pro)

Primary blind scoring was performed by Claude (Anthropic) used as an LLM judge.

Methods: primary blind scoring explicitly performed by Claude.

high null result Literary Narrative as Moral Probe : A Cross-System Framework... agent used for primary blind scoring (Claude)

Re-administration under declared conditions produced zero delta across all 16 dimension-pair comparisons (no measurable change when declaration status changed).

Reported repeated-measures comparisons across 16 predefined dimension pairs between blind and declared administrations, with reported zero delta.

high null result Literary Narrative as Moral Probe : A Cross-System Framework... difference (delta) in scores across 16 dimension-pair comparisons between blind ...

Series 2 consisted of local and API open-source systems (n = 6) administered blind and declared, with four systems re-administered under declared conditions.

Methods description detailing Series 2 composition, modes (blind and declared), and that four systems were re-tested under declared conditions.

high null result Literary Narrative as Moral Probe : A Cross-System Framework... count of systems in Series 2 (n=6) and number re-administered under declared con...

Series 1 consisted of frontier commercial systems administered blind (n = 7).

Methods description specifying Series 1 composition and blind administration.

high null result Literary Narrative as Moral Probe : A Cross-System Framework... count of systems in Series 1 (n=7) and administration mode (blind)

The study employed 24 experimental conditions spanning 13 distinct LLM systems across two series.

Study design reported in Methods: Series 1 (frontier commercial, blind, n=7), Series 2 (local/API open-source, blind and declared, n=6), plus re-administered declared runs and ceiling-probe runs summing to 24 conditions.

high null result Literary Narrative as Moral Probe : A Cross-System Framework... number of experimental conditions and distinct systems tested (study scope)

The paper does not claim proprietary deployment metrics beyond qualitative field observations; experimental formalizations are provided for reproducible evaluation instead.

Authors explicitly note they document how to reproduce experiments but do not claim proprietary deployment metrics beyond qualitative field observations.

high null result Bridging Protocol and Production: Design Patterns for Deploy... degree to which empirical claims are qualitative field observations vs. propriet...

The paper recommends tracking specific operational and economic metrics: MTTR for tool failures, per-invocation latency variance, per-interaction operational cost, frequency of identity-related incidents, human remediation hours per 1,000 incidents, and SLA breach rates.

Explicit list of recommended metrics in the implications and metrics-to-track sections of the paper.

high null result Bridging Protocol and Production: Design Patterns for Deploy... the listed operational/economic metrics (MTTR, latency variance, costs, incident...

The paper provides a production-readiness checklist and instructions for reproducible evaluation alongside the proposed mechanisms.

Deliverables enumerated in the paper include a production-readiness checklist and reproducible experimental methodology.

high null result Bridging Protocol and Production: Design Patterns for Deploy... existence of a production-readiness checklist and reproducible evaluation instru...

All three proposed mechanisms (CABP, ATBA, SERF) are formalized as testable hypotheses with reproducible experimental methodology (benchmarks, latency/error models, broker pipeline semantics).

Paper includes formal descriptions and reproducible evaluation instructions and benchmarks; authors state methods to reproduce experiments are provided.

high null result Bridging Protocol and Production: Design Patterns for Deploy... availability and completeness of reproducible experimental methodology for each ...

The paper organizes production failure modes across five dimensions—server contracts, user context, timeouts, errors, and observability—and provides concrete failure vignettes from an enterprise deployment.

Taxonomy and failure vignettes are listed as design artifacts and deliverables in the paper; derived from observational analysis of production logs and incidents.

high null result Bridging Protocol and Production: Design Patterns for Deploy... classification coverage of failure incidents across the five dimensions

The work is qualitative and exploratory — presenting naturalistic phenomena rather than causal empirical estimates, and is intended to be hypothesis-generating rather than definitive.

Methodology explicitly stated: naturalistic, qualitative daily observations over one month across multiple platforms; comparative observational documentation without experimental manipulation or causal identification.

high null result When Openclaw Agents Learn from Each Other: Insights from Em... nature of evidence (qualitative/exploratory vs. causal inference)

Future empirical work should measure calibration (user trust vs. model accuracy), hallucination rate, user comprehension of capability limits, and behavioral dependence on system recommendations.

Explicit methodological recommendations and suggested metrics in the paper; these are proposed future measurements rather than reported findings.

high null result Why We Need to Destroy the Illusion of Speaking to A Human: ... calibration metrics, hallucination rates, user comprehension, behavioral depende...

Conversational AI differs from interpersonal conversation: it has no true beliefs/intentions or accountability and produces probabilistic, sometimes inconsistent outputs with opaque training/data provenance.

Analytical/distinctive claim based on properties of LLMs and machine learning models discussed in the paper; conceptual analysis, no empirical testing.

high null result Why We Need to Destroy the Illusion of Speaking to A Human: ... ontological status of AI outputs (beliefs/intentions/accountability) and propert...

Research agenda items for economists include: quantifying willingness-to-pay for verifiable reasoning, studying labor-market impacts for validators, designing contracts/mechanisms to incentivize truthful argument provision, and evaluating regulatory interventions.

Paper's stated research and policy agenda; prescriptive rather than empirical.

high null result Argumentative Human-AI Decision-Making: Toward AI Agents Tha... existence and prioritization of empirical research on WTP, labor impacts, mechan...

Evaluation currently lacks metrics and benchmarks for argument quality, fidelity, contestability, and human trust; developing these is necessary.

Paper notes the gap and proposes evaluation metrics and experimental designs; no new benchmarks introduced.

high null result Argumentative Human-AI Decision-Making: Toward AI Agents Tha... availability and maturity of evaluation metrics and benchmarks

Evaluation metrics for the architecture should include sample efficiency, generalization across tasks, robustness to distribution shift, autonomy (fraction of learning decisions made internally), transfer speed, lifelong retention, and safety/constraint adherence.

Explicit recommendations for evaluation metrics in the paper.

high null result Why AI systems don't learn and what to do about it: Lessons ... listed evaluation metrics (sample efficiency; generalization; robustness; autono...

This paper is a conceptual/theoretical architecture proposal rather than an empirical study; empirical validation should follow via suggested experiments.

Explicit statement in the paper about nature of contribution.

high null result Why AI systems don't learn and what to do about it: Lessons ... N/A (no empirical outcomes reported)

Results are from role-play contexts and short-term interventions; economic estimates of benefit require validation in field settings, across diverse populations, and with different LLM models.

Authors' caveats and limitations stated in the paper noting external validity concerns and the experimental context (role-play, short-term follow-up).

high null result Practicing with Language Models Cultivates Human Empathic Co... generalizability/external validity (not directly measured)

Outcome measures included alignment to the normative taxonomy (coding/automated), recipient-rated perceptions of being heard/validated, and blinded empathy judgments.

Methods section description listing primary and secondary outcomes used in the trial and evaluations.

high null result Practicing with Language Models Cultivates Human Empathic Co... alignment metrics, recipient-rated perceptions, blinded empathy judgments

A data-driven taxonomy was derived mapping common idiomatic empathic moves (e.g., validation, perspective-taking, emotional labeling, offers of support) used in naturalistic support conversations.

Textual analysis of the collected corpus (33,938 messages) produced an operational taxonomy of idiomatic empathic expressions used in the role-play dialogues.

high null result Practicing with Language Models Cultivates Human Empathic Co... taxonomy of empathic communication moves (categorical coding scheme)

The Lend an Ear platform collected a large conversational corpus: 33,938 messages across 2,904 conversations with 968 participants.

Dataset description reported in the paper specifying counts of participants, conversations, and messages used to build and analyze communication patterns.

high null result Practicing with Language Models Cultivates Human Empathic Co... corpus size (number of messages, conversations, participants)

Suggested empirical research directions for AI economists include: comparing LLM performance and economic outcomes on rule‑encodable vs tacit tasks; quantifying performance decline when forcing LLMs into interpretable rule representations; studying contracting/pricing where buyers cannot verify internal rules; and measuring returns to scale attributable to tacit capabilities.

Explicitly enumerated recommended research agenda items in the paper; these are proposed studies rather than executed work.

high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... proposed empirical research topics and corresponding outcomes to measure

New metrics are needed to value tacit capabilities — e.g., measures of transfer, generalization under distribution shifts, ease of integrating with human workflows, and irreducibility to compressed rule representations.

Methodological recommendation in the paper listing specific metric categories for future empirical work.

high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... proposed metrics for assessing tacit LLM capabilities

Suggested empirical validations (not performed) include benchmarking LLMs versus rule systems on allegedly rule‑encodable tasks, attempting rule extraction and measuring fidelity loss, and compression/distillation studies to quantify irreducible task performance.

Recommendations and proposed experimental directions listed in the paper; these are proposals, not executed studies.

high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... types of empirical tests recommended for validating the thesis

The paper contains mostly qualitative and historically grounded empirical content and reports no primary datasets or large‑scale experimental results in support of the formal thesis.

Explicit declaration in the Data & Methods section that empirical content is qualitative/historical and no new datasets were collected.

high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... extent of empirical/quantitative evidence presented

The paper's core methodological approach is conceptual and theoretical argumentation (formal/logical proof, historical examples, and philosophical framing), not empirical experimentation.

Stated Data & Methods description indicating reliance on formal logic, historical case analysis, and philosophical argument; absence of primary datasets.

high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... presence/absence of empirical experiments in the paper

LLM-as-Judge finds no significant difference between the retrieval-augmented and vanilla generators (p = 0.584).

Comparative evaluation using standard LLM-as-Judge metrics reported in the paper on the same experimental setup; reported p-value = 0.584.

high null result HindSight: Evaluating LLM-Generated Research Ideas via Futur... LLM-judge evaluation metric (e.g., LLM-assigned quality/novelty scores for gener...

The LEAFE algorithmic procedure: summarize environment feedback into compact experience items; backtrack to earlier decision points causally linked to failures and re-explore corrective action branches; distill corrected trajectories into the policy via supervised fine-tuning.

Method section / algorithm description in paper specifying the reflective/backtracking and distillation pipeline as the core of LEAFE.

high null result Internalizing Agency from Reflective Experience N/A (algorithmic procedure description rather than an outcome)

Human-quality proxies were used for evaluation and comparisons were made against Claude Opus 4.6 and other baselines.

Evaluation description: use of human-quality proxy metrics and direct comparisons across models on the 48-brief benchmark.

high null result Learning to Present: Inverse Specification Rewards for Agent... Human-quality proxy scores and comparative model rankings

The reward function is a composite multi-component signal combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics (factuality, coverage, coherence), and an inverse-specification reward.

Reward design section enumerating each component and how they contribute to the composite reward used in RL training.

high null result Learning to Present: Inverse Specification Rewards for Agent... Components of the reward signal used for RL training

The RL environment is OpenEnv-compatible and enables agent tool use for web/knowledge access, planning, and a rendering pipeline.

Methods description: OpenEnv-compatible RL environment with tool interfaces (web/knowledge access and rendering) used during multi-turn planning and execution.

high null result Learning to Present: Inverse Specification Rewards for Agent... Environment capabilities: OpenEnv compatibility and tool-use support

Code for the environment and experiments is released at the specified GitHub repository.

Artifacts: code release reported at https://github.com/pushing-the-frontier/slide-forge-llm.

high null result Learning to Present: Inverse Specification Rewards for Agent... Availability of experiment code (GitHub repo)

The SlideRL dataset of 288 multi-turn rollout trajectories across six models is released for reproducibility.

Artifacts released: SlideRL dataset reported as 288 multi-turn rollouts, hosted at provided Hugging Face URL.

high null result Learning to Present: Inverse Specification Rewards for Agent... Number of rollout trajectories in dataset (288) and coverage across models (6)

Evaluation was conducted on 48 diverse business briefs across six models.

Data & Methods: evaluation suite comprised 48 business briefs selected for diversity; six models compared.

high null result Learning to Present: Inverse Specification Rewards for Agent... Number of evaluation tasks (48 briefs) and number of models compared (6)

Training prompts were derived from expert demonstrations collected using Claude Opus 4.6 to bootstrap training data.

Methods: expert demonstration prompts collected from Claude Opus 4.6 used as seed/bootstrapping data for training.

high null result Learning to Present: Inverse Specification Rewards for Agent... Source of demonstration prompts (Claude Opus 4.6)

Fine-tuning was done parameter-efficiently: only 0.5% of the Qwen2.5-Coder-7B parameters were trained using GRPO.

Methods section: GRPO-based reinforcement learning fine-tuning, with parameter-efficient update covering 0.5% of model parameters.

high null result Learning to Present: Inverse Specification Rewards for Agent... Proportion of model parameters updated during training (0.5%)

Detailed quantitative coverage, throughput, or other numeric validation metrics were not reported beyond the timeline (quarter-level) claim.

Summary states measured benefits were qualitative and process metrics; no detailed quantitative throughput/coverage numbers provided. (Meta-claim about the evidence reported.)

high null result ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... absence of detailed quantitative validation metrics in the reported results

Evaluation used seven benchmarks spanning online computer-use, offline computer-use, and multimodal tool-use reasoning tasks.

Benchmarks section in the summary states seven benchmarks covering those categories; no benchmark names or dataset sizes provided in the summary.

high null result Anticipatory Planning for Multimodal AI Agents benchmark task performance (task success, generalization)

Objectives combine trajectory-level rewards (for global consistency) with stepwise grounded rewards derived from execution outcomes.

Method summary explicitly lists these objectives as part of the TraceR1 training procedure.

high null result Anticipatory Planning for Multimodal AI Agents global plan consistency and stepwise execution outcomes

TraceR1 focuses on short-horizon trajectory forecasting to keep predictions tractable while capturing near-term consequences of actions.

Framework description in summary that emphasizes 'short-horizon trajectory forecasting' as a design choice.

high null result Anticipatory Planning for Multimodal AI Agents forecast horizon (short-horizon) / tractability of predictions

During grounded fine-tuning, tools are treated as frozen agents and only the policy is adjusted using execution feedback (tools are not modified).

Explicit statement in Data & Methods section of the summary describing tool handling during grounded fine-tuning.

high null result Anticipatory Planning for Multimodal AI Agents policy adaptation to tool execution feedback / tool-compatibility of executed ac...

Stage 2 of TraceR1 is a grounded fine-tuning phase that refines step-level accuracy and executability using execution feedback from frozen tool agents.

Method description in summary: Stage 2 — Grounded fine-tuning using execution feedback; tools are not retrained (treated as frozen agents) and feedback is used to adjust the policy.

high null result Anticipatory Planning for Multimodal AI Agents step-level execution accuracy and executability

TraceR1 uses a two-stage training procedure: Stage 1 trains trajectory-level RL on predicted short-horizon trajectories with rewards that enforce global consistency.

Method description in summary: Stage 1 — Trajectory-level RL with trajectory-level rewards to encourage global consistency across predicted action-state sequences.

high null result Anticipatory Planning for Multimodal AI Agents trajectory-level plan coherence / global consistency

Evaluation combined verifiability checks (fact/claim accuracy where possible), qualitative coding of strategic reasoning, and longitudinal comparison across nodes.

Methods description detailing a mixed evaluation approach: verifiability checks for factual items, qualitative coding for strategic narratives, and longitudinal comparisons over the 11 nodes.

high null result When AI Navigates the Fog of War evaluation components (verifiability checks, qualitative coding, longitudinal an...

The evaluation was conducted at 11 discrete temporal nodes during the crisis to capture changing public information and uncertainty.

Methods specification: definition and use of 11 temporal nodes as the backbone of the temporally grounded evaluation.

high null result When AI Navigates the Fog of War number of temporal nodes

The study used 42 node-specific verifiable questions plus 5 broader exploratory prompts to probe factual inferences and higher-level strategic reasoning.

Methods specification: explicit count of 42 verifiable, node-specific questions and 5 exploratory prompts designed by the study authors.

high null result When AI Navigates the Fog of War number and type of questions/prompts used

Measuring the marginal cost of runtime governance, the tradeoff curve between task completion and compliance risk, and calibrating violation probabilities are open empirical research questions identified by the paper.

Explicit list of open problems and proposed empirical research agenda in the Implications/Measurement sections of the paper.

high null result Runtime Governance for AI Agents: Policies on Paths existence of empirical research gaps (identified/not identified)

No large empirical dataset or large-scale field experiments were used; the work is primarily theoretical/formal with simulations and worked examples rather than empirical validation.

Paper's Methods/Data section explicitly states the work is theoretical/formal and lists reference implementation and simulations instead of large empirical studies.

high null result Runtime Governance for AI Agents: Policies on Paths use of empirical data (presence/absence of large-scale empirical evaluation)

Risk calibration—mapping violation probabilities to enforcement actions and thresholds—is a key unsolved operational problem for runtime governance.

Paper highlights open problems including risk calibration; argued via conceptual analysis and operational concerns (false positives/negatives, costs of blocking actions).

high null result Runtime Governance for AI Agents: Policies on Paths existence of calibrated thresholds and procedures (presence/absence)

« Prev 1 2 3 … 14 15 16 … 105 106 Next »