Evidence (3103 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	378	106	59	455	1007
Governance & Regulation	379	176	116	58	739
Research Productivity	240	96	34	294	668
Organizational Efficiency	370	82	63	35	553
Technology Adoption Rate	296	118	66	29	513
Firm Productivity	277	34	68	10	394
AI Safety & Ethics	117	177	44	24	364
Output Quality	244	61	23	26	354
Market Structure	107	123	85	14	334
Decision Quality	168	74	37	19	301
Fiscal & Macroeconomic	75	52	32	21	187
Employment Level	70	32	74	8	186
Skill Acquisition	89	32	39	9	169
Firm Revenue	96	34	22	—	152
Innovation Output	106	12	21	11	151
Consumer Welfare	70	30	37	7	144
Regulatory Compliance	52	61	13	3	129
Inequality Measures	24	68	31	4	127
Task Allocation	75	11	29	6	121
Training Effectiveness	55	12	12	16	96
Error Rate	42	48	6	—	96
Worker Satisfaction	45	32	11	6	94
Task Completion Time	78	5	4	2	89
Wages & Compensation	46	13	19	5	83
Team Performance	44	9	15	7	76
Hiring & Recruitment	39	4	6	3	52
Automation Exposure	18	17	9	5	50
Job Displacement	5	31	12	—	48
Social Protection	21	10	6	2	39
Developer Productivity	29	3	3	1	36
Worker Turnover	10	12	—	3	25
Skill Obsolescence	3	19	2	—	24
Creative Output	15	5	3	1	24
Labor Share of Income	10	4	9	—	23

Human Ai Collab Remove filter

Leaders' AI symbolization strengthens AI's positive effect on employees' sense of self-determination.

Moderation analysis within the same four-stage longitudinal survey of 285 finance professionals; leader AI symbolization tested as moderator of AI usage -> sense of self-determination path.

high positive Autonomous enhancement or emotional depletion? The dual-path... sense of self-determination (moderated by leaders' AI symbolization)

AI usage can boost innovative work behavior by enhancing employees' sense of self-determination.

Four-stage longitudinal study (survey) of finance professionals (N=285); mediation analysis testing AI usage -> sense of self-determination -> innovative work behavior, grounded in SOR theory.

high positive Autonomous enhancement or emotional depletion? The dual-path... innovative work behavior (mediated by sense of self-determination)

Retrieval substantially improves reasoning over textual fundamentals.

Result reported from the experiments comparing zero-shot prompting to retrieval-augmented settings on fundamentals-focused questions; the paper asserts that retrieval provided substantial improvement for textual fundamentals reasoning.

high positive FinTradeBench: A Financial Reasoning Benchmark for LLMs improvement in reasoning/performance on fundamentals-focused questions with retr...

Human-AI systems should be designed under a cognitive sustainability constraint so that gains in hybrid performance do not come at the cost of degradation in human expertise.

Normative recommendation in the paper based on the conceptual/mathematical framework and the identified trade-off; presented as an argument rather than empirically validated policy outcome in the excerpt.

high positive Cognitive Amplification vs Cognitive Delegation in Human-AI ... preservation of human expertise under human-AI design choices

Together, these quantities provide a low-dimensional metric space for evaluating whether human-AI systems achieve genuine synergistic performance and whether such performance is cognitively sustainable for the human component over time.

Claim about the utility of the defined metrics, supported within the paper by the conceptual/mathematical framework and the proposed metric definitions (theoretical demonstration rather than reported empirical validation in the excerpt).

high positive Cognitive Amplification vs Cognitive Delegation in Human-AI ... hybrid human-AI performance and cognitive sustainability

The paper defines a set of operational metrics: the Cognitive Amplification Index (CAI*), the Dependency Ratio (D), the Human Reliance Index (HRI), and the Human Cognitive Drift Rate (HCDR).

Explicit listing of newly proposed operational metrics in the paper; this is a descriptive claim about the paper's content (theoretical definitions), no sample size or empirical estimation provided in the excerpt.

high positive Cognitive Amplification vs Cognitive Delegation in Human-AI ... operational metrics for human-AI cognitive interaction (CAI*, D, HRI, HCDR)

The paper introduces a conceptual and mathematical framework to distinguish cognitive amplification (AI improves hybrid human-AI performance while preserving human expertise) from cognitive delegation (reasoning is progressively outsourced to AI).

Explicit contribution claim in the paper (description of a conceptual and mathematical framework); evidence consists of the model and formal definitions presented in the paper (no external empirical validation reported in the excerpt).

high positive Cognitive Amplification vs Cognitive Delegation in Human-AI ... mode of human-AI interaction (amplification vs delegation)

Given these findings, policymakers should favor 'strategic forbearance'—apply existing laws rather than create new regulations that could stifle innovation and diffusion of AI.

Authors' normative policy recommendation based on their interpretation of the reviewed empirical literature (risk–benefit assessment); this is a prescriptive conclusion rather than an empirical finding, so no sample size applies.

high positive AI, Productivity, and Labor Markets: A Review of the Empiric... regulatory approach to AI governance (strategy of forbearance vs. new regulation...

Generative AI lowers entry costs for startups, facilitating new firm entry and product development.

Cited empirical and descriptive evidence in the literature review indicating reduced development costs and faster product prototyping enabled by AI tools; the brief does not provide a pooled sample size or a single quantitative estimate.

high positive AI, Productivity, and Labor Markets: A Review of the Empiric... barriers to entry / startup costs and rate of new product development

Generative AI significantly boosts productivity in specific tasks like coding, writing, and customer service—often by 15% to 50%.

Synthesis/review of empirical literature through 2025 (multiple empirical studies of task-level impacts, including field and lab studies and observational analyses); the brief reports aggregate reported effect ranges but does not list a single pooled sample size.

high positive AI, Productivity, and Labor Markets: A Review of the Empiric... task-level productivity in coding, writing, and customer service

The AgentDS benchmark datasets are open-sourced and available at https://huggingface.co/datasets/lainmn/AgentDS.

Paper includes link to the open-source datasets and the AgentDS website.

high positive AgentDS Technical Report: Benchmarking the Future of Human-A... availability of datasets

The strongest solutions arise from human-AI collaboration.

Analysis of competition results showing top-performing submissions employed human-AI collaborative approaches rather than AI-only baselines (results from 29 teams / 80 participants).

high positive AgentDS Technical Report: Benchmarking the Future of Human-A... performance of human-AI collaborative solutions

We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science.

Paper describes the creation of the AgentDS benchmark and an associated competition as the study's primary methodological contribution.

high positive AgentDS Technical Report: Benchmarking the Future of Human-A... benchmark for evaluating AI agents and human-AI collaboration

Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow.

Statement in the paper referencing recent developments in LLMs and AI agents; presented as motivation rather than validated empirically within the paper.

high positive AgentDS Technical Report: Benchmarking the Future of Human-A... automation of data science workflow

Data science plays a critical role in transforming complex data into actionable insights across numerous domains.

Background statement in the paper (no empirical test or dataset provided to support this claim).

high positive AgentDS Technical Report: Benchmarking the Future of Human-A... transforming complex data into actionable insights

LLM-generated peer reviews assign scores that, on average, are a full point higher than human reviews.

Analysis of scores in the conference peer review dataset comparing LLM-generated vs human reviews; the excerpt states an average increase of one full point but does not include sample size or scale range.

high positive How LLMs Distort Our Written Language assigned review scores

About 21% of scientific peer reviews at a recent top AI conference were AI-generated (LLM-generated) in the wild.

Analysis of peer reviews from a recent top AI conference reported in the paper; the excerpt reports the 21% figure but does not give total number of reviews in the excerpt.

high positive How LLMs Distort Our Written Language share/proportion of peer reviews that were AI-generated

Even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning.

Experiment in which LLMs were given expert feedback and explicit instructions to perform only grammar edits; comparisons show significant semantic alteration despite constrained instructions; sample size not provided.

high positive How LLMs Distort Our Written Language semantic alteration of text despite constrained grammar-only prompt

Using a dataset of human-written essays (collected in 2021 before widespread LLM release), asking an LLM to revise essays based on human-written feedback induces large changes in the resulting content and meaning.

Controlled experiments applying LLM revision to a pre-LLM essay dataset and comparing pre- and post-revision content/semantics; dataset described as collected in 2021 but sample size not stated in the excerpt.

high positive How LLMs Distort Our Written Language magnitude of content and semantic changes after LLM revision

In a human user study, extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question.

Human user study reported in the paper; the excerpt gives the quantified result (nearly 70% increase) but does not report sample size here.

high positive How LLMs Distort Our Written Language proportion of essays judged as neutral in answering the topic question

LLMs consistently alter the intended meaning of human writing.

Experiments in which human-written essays were revised by LLMs (including prompts asking only for grammar edits) and comparison of pre- and post-LLM text semantics; exact sample sizes not stated in the excerpt.

high positive How LLMs Distort Our Written Language degree of semantic change / alteration of intended meaning

LLMs alter the voice and tone of human writing.

Reported results from a human user study and subsequent experiments comparing original human-written text to LLM-assisted/LLM-revised text; sample sizes not provided in the excerpt.

high positive How LLMs Distort Our Written Language change in voice and tone of writing

Large language models (LLMs) are used by over a billion people globally, most often to assist with writing.

Statement in paper (likely based on external usage statistics or surveys cited by authors); no sample size reported in the provided text.

high positive How LLMs Distort Our Written Language LLM adoption and primary use case (writing assistance)

End-to-end verified pipelines can produce provably correct code from informal specifications.

The paper surveys early research demonstrating pipelines that go from informal specifications to formally verified code; the provided text does not include experimental sample sizes or benchmarks.

high positive Intent Formalization: A Grand Challenge for Reliable Coding ... provable correctness of generated code

AI-generated postconditions catch real-world bugs missed by prior methods.

Surveyed early research asserted by the paper indicating empirical instances where AI-generated postconditions found bugs that other methods missed; no numeric details provided in the excerpt.

high positive Intent Formalization: A Grand Challenge for Reliable Coding ... bugs detected / error detection rate

Interactive test-driven formalization improves program correctness.

Paper surveys early research that reportedly demonstrates this effect (described as 'interactive test-driven formalization that improves program correctness'); the excerpt does not include specific study details or sample sizes.

high positive Intent Formalization: A Grand Challenge for Reliable Coding ... program correctness

The central bottleneck is validating specifications: since there is no oracle for specification correctness other than the user, we need semi-automated metrics that can assess specification quality with or without code, through lightweight user interaction and proxy artifacts such as tests.

Analytical claim and research agenda item in the paper; motivates need for new metrics and interaction designs. No empirical validation or sample size reported in the excerpt.

high positive Intent Formalization: A Grand Challenge for Reliable Coding ... ability to validate specification correctness / specification quality

Intent formalization offers a tradeoff spectrum suitable to the reliability needs of different contexts: from lightweight tests that disambiguate likely misinterpretations, through full functional specifications for formal verification, to domain-specific languages from which correct code is synthesized automatically.

Conceptual framework proposed in the paper describing a spectrum of specification formality; presented as an argument rather than an empirical finding, with no sample sizes provided in the excerpt.

high positive Intent Formalization: A Grand Challenge for Reliable Coding ... suitability of specification approaches for reliability requirements

Intent formalization — translating informal user intent into checkable formal specifications — is the key challenge that will determine whether AI makes software more reliable or merely more abundant.

Normative argument presented by the authors as the central thesis of the paper; no empirical study or sample size cited in the provided text.

high positive Intent Formalization: A Grand Challenge for Reliable Coding ... software reliability (correctness relative to user intent)

Agentic AI systems can now generate code with remarkable fluency.

Authoritative assertion in the paper based on contemporary observations of large code-generating models; no empirical sample size or benchmark numbers reported in the text provided.

high positive Intent Formalization: A Grand Challenge for Reliable Coding ... code generation fluency / ability to produce code

The initially selected candidates determine both the benchmark of success and the direction of improvement.

Theoretical result asserted by the authors based on analysis of the closed-loop system (paper's analytical finding).

high positive Actionable Recourse in Competitive Environments: A Dynamic G... influence of initially selected group on subsequent benchmark and improvement di...

Rejected individuals exert effort to improve actionable features along directions implied by the decision rule.

Model assumption and dynamic behavior encoded in the proposed framework (assumption/behavioral mechanism in the model).

high positive Actionable Recourse in Competitive Environments: A Dynamic G... effort or change in actionable features by rejected candidates

External inputs that bypass internal filtering shorten recognition delays (i.e., speed up detection of regime shifts).

Model extensions/analysis showing that when some inputs are allowed to bypass internal exclusion mechanisms, the dynamics of anchor updating detect regime changes faster; result comes from theoretical model manipulations, not empirical testing.

high positive Cohesion as Concentration: Exclusion-Driven Fragility in Fin... time to recognize regime shift (recognition delay)

In a preregistered mediation model, perceived accountability mediated the AI-over-questionnaire effect on goal progress (indirect effect = 0.15, 95% CI [0.04, 0.31]).

Mediation analysis preregistered and reported in the paper using data from the RCT (N = 517); indirect effect estimate 0.15 with 95% confidence interval [0.04, 0.31].

high positive AI-Assisted Goal Setting Improves Goal Progress Through Soci... goal progress (mediated by perceived social accountability)

The AI chatbot produced significantly higher goal progress than the no-support control at two-week follow-up.

Between-groups comparison in the preregistered RCT (N = 517); reported effect size d = 0.33 and p = .016 for AI vs control on goal progress measured at two-week follow-up.

high positive AI-Assisted Goal Setting Improves Goal Progress Through Soci... goal progress (self-reported goal progress at two-week follow-up)

The authors provide a demo video, a hosted website, and an installable package demonstrating JobMatchAI.

Paper explicitly states availability of a demo video, a hosted website, and an installable package. No links, access dates, or artifact verification details are provided in the excerpt.

high positive JobMatchAI An Intelligent Job Matching Platform Using Knowle... availability of demonstration artifacts (video, hosted website, installable pack...

The authors provide a hybrid retrieval stack combining BM25, a skill knowledge graph, and semantic components to evaluate skill generalization.

Paper describes a hybrid retrieval stack composed of BM25, a knowledge graph, and semantic retrieval components intended for evaluation of skill generalization. No evaluation metrics or comparisons are included in the excerpt.

high positive JobMatchAI An Intelligent Job Matching Platform Using Knowle... retrieval stack composition (BM25 + knowledge graph + semantic components) inten...

The authors release JobSearch-XS benchmark.

Paper explicitly states release of the JobSearch-XS benchmark. No dataset size, annotation protocol, or access URL provided in the excerpt.

high positive JobMatchAI An Intelligent Job Matching Platform Using Knowle... availability of JobSearch-XS benchmark (artifact release)

JobMatchAI integrates Transformer embeddings, skill knowledge graphs, and interpretable reranking.

Statement in paper describing system architecture and components (implementation claim). No quantitative implementation details or component-level ablation results provided in the supplied excerpt.

high positive JobMatchAI An Intelligent Job Matching Platform Using Knowle... system design / component integration (presence of Transformer embeddings, knowl...

TDAD (Test-Driven Agentic Development) combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change.

Description of the tool/methodology and its implementation (TDAD is presented as an open-source tool in the paper).

high positive TDAD: Test-Driven Agentic Development - Reducing Code Regres... identification/surfacing of tests likely impacted by code changes (test prioriti...

On the LoCoMo benchmark, the architecture achieves 74.8% overall accuracy.

Benchmark evaluation reported in the paper using the LoCoMo benchmark with a reported overall accuracy of 74.8%.

high positive Governed Memory: A Production Architecture for Multi-Agent W... overall accuracy on the LoCoMo benchmark (percentage)

Adversarial governance compliance was 100%.

Adversarial compliance testing reported in the paper (linked to the adversarial query experiments); reported compliance = 100%.

high positive Governed Memory: A Production Architecture for Multi-Agent W... governance compliance under adversarial queries (percentage)

There was zero cross-entity leakage across 500 adversarial queries.

Adversarial testing reported in the paper: 500 adversarial queries used to test cross-entity leakage; result = zero leakage.

high positive Governed Memory: A Production Architecture for Multi-Agent W... cross-entity information leakage (count/occurrence across 500 queries)

Progressive context delivery yielded a 50% token reduction.

Reported experimental result in the controlled experiments indicating token usage reduction from progressive delivery = 50%.

high positive Governed Memory: A Production Architecture for Multi-Agent W... token usage reduction (percentage)

Governance routing precision was 92% in the experiments.

Reported experimental metric from the controlled experiments (N=250, five content types) showing governance routing precision = 92%.

high positive Governed Memory: A Production Architecture for Multi-Agent W... governance routing precision (percentage)

The system achieved 99.6% fact recall (with complementary dual-modality coverage) in the controlled experiments.

Reported experimental result from the controlled experiments (N=250, five content types) as stated in the paper.

high positive Governed Memory: A Production Architecture for Multi-Agent W... fact recall (percentage recall of facts)

The study's strengths include multimethod triangulation, a very large behavioral dataset (150 million interactions), and controlled simulation experiments informed by empirical observation.

Methods reported: mixed‑methods sequential design with (1) 6‑month lab ethnography (n = 23), (2) computational analysis of 150 million customer interactions, and (3) empirically grounded agent‑based simulation experiments.

high positive The Algorithmic Canvas: On the Autopoietic Redefinition of S... study validity/robustness (methodological strength)

The Algorithmic Canvas is an operational medium where segmentation, targeting, and positioning parameters co‑evolve through iterative human–AI collaboration.

Design and implementation described in the study; observation of Canvas‑mediated interactions during a 6‑month lab ethnography inside a Fortune 500 company (n = 23).

high positive The Algorithmic Canvas: On the Autopoietic Redefinition of S... co‑evolution of STP parameters (qualitative and operational behavior observed vi...

Autopoietic STP + Algorithmic Canvas approach is 44% more resilient to market shocks than traditional, process‑based STP (p < 0.01).

Agent‑based simulations and comparative analyses informed by empirical calibration; supported by large‑scale behavioral data (150 million customer interactions) and simulation experiments. Statistical test reported with p < 0.01. Exact number of simulation runs and full test details not specified in the summary.

high positive The Algorithmic Canvas: On the Autopoietic Redefinition of S... resilience to market shocks (comparative resilience between autopoietic vs. trad...

Rigorous research priorities include randomized controlled trials with long-run follow-ups, cost-effectiveness studies, structural adoption models, and validated metrics for feedback quality and learning durability.

Actionable research recommendations produced by the 50-scholar interdisciplinary meeting; prescriptive synthesis rather than empirical results.

high positive The Future of Feedback: How Can AI Help Transform Feedback t... existence and quality of RCTs and long-run studies; availability of validated me...

« Prev 1 2 3 … 21 22 23 … 62 63 Next »