Evidence (14055 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

The paper examines operational logic, defining features and emerging use cases of agentic payments across retail, e-commerce and decentralised finance.

Stated scope in the abstract; analysis and case-study-driven review across specified sectors (retail, e-commerce, DeFi). No sample sizes reported.

high null result AI Agents in Payments: Applications, Risks and Regulations emerging use cases / sector-level application

Agentic payments refer to transactions initiated and completed by AI agents without direct human intervention.

Explicit definitional statement in the abstract (conceptual definition provided by the authors).

high null result AI Agents in Payments: Applications, Risks and Regulations definition/characterisation of a payment modality

Current evidence does not support the simple claim that autonomous code generation automatically improves engineering outcomes.

Synthesis of mixed results from controlled studies, meta-analyses, and benchmarks reported in the paper (no single sample size given in abstract).

high null result Agentic Agile-V: From Vibe Coding to Verified Engineering in... engineering outcomes (overall improvement from autonomous code generation)

However, the exoplanet workflow is effectively tied with a strong combined-summary baseline, showing that decomposition does not always improve top-line performance.

Reported comparison between the coordinated workflow and a strong combined-summary baseline for exoplanet vetting indicating no meaningful improvement.

high null result Cross-domain benchmarks reveal when coordinated AI agents im... relative performance vs. combined-summary baseline for exoplanet vetting

All [the listed orchestration frameworks] follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn.

Author assertion based on architectural analysis of the listed frameworks (observation of orchestration pattern in the named projects).

high null result Compiling Agentic Workflows into LLM Weights: Near-Frontier ... architectural pattern (external orchestrator behavior)

The study used established measurement scales to assess AI-driven learning culture, knowledge orchestration, organisational intelligence and innovation performance.

Methods: authors report use of established scales for AIDLC, KO, OI and IP in the questionnaire.

high null result Enhancing innovation in Pakistan’s IT sector measurement validity / constructs used

Structured questionnaires were distributed between March and October 2025 to employees involved in innovation, learning and project management roles in Karachi, Lahore and Islamabad.

Methods section description of data collection period, target respondent roles, and cities covered.

high null result Enhancing innovation in Pakistan’s IT sector data collection protocol (timing and respondent roles)

Most respondents held undergraduate or postgraduate degrees in computer science, engineering or business-related disciplines.

Sample demographic summary from the survey (N=348).

high null result Enhancing innovation in Pakistan’s IT sector respondent educational background

After screening the data, 348 valid responses were analyzed.

Structured questionnaires distributed March–October 2025 to employees in medium and large IT firms in Karachi, Lahore and Islamabad; screening produced 348 valid responses (sample description in methods).

high null result Enhancing innovation in Pakistan’s IT sector sample_size

The paper draws on empirical studies from 2024–2026.

Methodological statement in the paper specifying the time window of empirical studies used in the analysis.

high null result The Algorithmic Mirror: Can Artificial Intelligence Truly Mi... temporal scope of literature reviewed

This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks.

Comparative evaluation reported in the paper showing that single-threshold (binary) scoring metrics do not exhibit the inverse-scaling pattern observed with tail-inclusive distributional metrics (specific metrics and calculations not given in excerpt).

high null result Is Capability a Liability? More Capable Language Models Make... relationship between model capability and accuracy under single-threshold metric...

Domain knowledge does not reliably rescue calibration.

Experiments reported in the paper where domain-knowledge interventions (procedures or prompts incorporating domain knowledge) were applied and did not consistently improve forecast calibration (details not provided in excerpt).

high null result Is Capability a Liability? More Capable Language Models Make... forecast calibration after incorporating domain knowledge

Using large language models, we measure the AIO level of Chinese listed companies from 2010 to 2023.

Authors report constructing firm-level measures of artificial intelligence orientation (AIO) by applying large language models to corporate texts/disclosures for Chinese listed companies over the 2010–2023 period.

high null result Artificial intelligence orientation and decarbonization spil... artificial intelligence orientation (AIO) measurement

We compared the traits causing the incidents with the traits that 197 developers building AI systems for those tasks would have preferred.

Study design: comparison between trait set responsible for incidents (from incident reports) and stated developer preferences collected from a sample of 197 developers working on those tasks.

high null result The Quiet Path from Seemingly Minor Design Errors to Workpla... developers' preferred AI system traits (self-reported)

We compared the extracted traits with the traits that 202 workers highly familiar with those tasks would have preferred.

Study design: a comparison between LLM-extracted traits from incident reports and stated preferences from a sample of 202 workers familiar with the tasks.

high null result The Quiet Path from Seemingly Minor Design Errors to Workpla... workers' preferred AI system traits (self-reported preferences)

We used an LLM-as-an-expert approach to extract the main traits of the AI systems involved in those incidents using an established framework of twelve traits.

Methods statement: applied a Large Language Model to code/extract AI system traits from the incident reports using an established 12-trait framework.

high null result The Quiet Path from Seemingly Minor Design Errors to Workpla... trait classification of AI systems involved in incidents

We analyzed 1,524 reports of incidents in which AI systems were used to perform 171 occupational tasks across 12 industry sectors.

Descriptive statement in paper: dataset comprised 1,524 incident reports, covering 171 occupational tasks and 12 industry sectors (dataset construction / corpus used for analysis).

high null result The Quiet Path from Seemingly Minor Design Errors to Workpla... scope and coverage of analyzed incident reports (number of incidents, tasks, and...

This study provides the first cross-class synthesis covering raw materials, work-in-process, and finished goods within a unified evaluative framework, positioning machine learning and deep reinforcement learning methods alongside classical policy families and quantifying the boundary conditions for each approach.

Author-stated theoretical contribution and scope of the review (coverage of raw materials, WIP, finished goods and methods).

high null result Equitable railway corridor investment under demand uncertain... breadth and novelty of synthesis across inventory classes and methods

A random-effects model estimated by restricted maximum likelihood was applied to pool percentage cost-reduction effect sizes across 18 studies admissible to quantitative synthesis.

Methods reported in the paper: random-effects meta-analysis using REML across 18 studies eligible for quantitative pooling.

high null result Equitable railway corridor investment under demand uncertain... pooled percentage cost-reduction effect sizes

A systematic review and meta-analytic synthesis of 31 peer-reviewed studies published between 2004 and 2025 was conducted following the PRISMA 2020 protocol.

Study methods reported in the paper: systematic review following PRISMA 2020; sample of 31 peer-reviewed studies dated 2004–2025.

high null result Equitable railway corridor investment under demand uncertain... number and coverage of studies included in the review

Çalışmada yapay zekâ göstergesi olarak yapay zekâ patent sayıları (AI patent counts) kullanılmıştır.

Metodolojik açıklama: bağımlı değişken olarak AI patent sayıları kullanımı; veri: G8 ülkeleri + Türkiye, 2010-2020.

high null result AR-GE HARCAMALARININ VE VERGİ TEŞVİKLERİNİN YAPAY ZEKAYA ETK... AI patent sayıları (tanımlayıcı/bağımlı değişken bildirimi)

The study uses PyQu to quantify changes across five quality attributes for Python code.

Methodological description: application of PyQu (an ML-based quality assessment tool for Python) to measure five quality attributes before and after refactoring edits.

high null result Quality and Security Signals in AI-Generated Python Refactor... five PyQu quality attributes (measured by the tool)

From the observed diffs, we derive a taxonomy of 24 recurring change operations.

Manual/automated analysis of diffs from the studied agentic refactoring PRs to identify and categorize recurring change operations into a 24-item taxonomy.

high null result Quality and Security Signals in AI-Generated Python Refactor... count and categorization of recurring change operations present in diffs

We will release the reanalysis pipeline to support replication.

Authors' statement of intent in the paper to release code/pipeline for replication.

high null result When Skills Don't Help: A Negative Result on Procedural Know... availability of reanalysis pipeline (planned release)

In offensive cybersecurity, the marginal benefit of Skills collapses: the spread between the no-Skills and full-Skills conditions is only 8.9 percentage points (p = 0.71, χ²; p = 0.25, Cochran–Armitage trend test; five of six pairwise Cohen's h values fall below the 0.2 small-effect threshold).

Statistical re-analysis of the 180-run CTF study comparing no-Skills vs full-Skills conditions: reported spread = 8.9 percentage points; reported p-values from χ² and Cochran–Armitage trend tests; reported Cohen's h comparisons.

high null result When Skills Don't Help: A Negative Result on Procedural Know... task pass rate (success rate) in Capture-the-Flag offensive cybersecurity tasks

Those four documentation conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation.

Authors map the four documentation-line-count conditions from the re-analyzed study to skill-ablation categories (No/Experiential/Curated/Comprehensive) as part of their interpretive re-analysis.

high null result When Skills Don't Help: A Negative Result on Procedural Know... mapping of documentation richness to Skill-ablation categories

We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions (55, 1,478, 1,976, and 4,147 lines).

Authors' re-analysis of an existing controlled study consisting of 180 runs and four documentation conditions with the stated line counts; this is a descriptive claim about the re-analysis dataset and experimental conditions.

high null result When Skills Don't Help: A Negative Result on Procedural Know... reanalysis dataset size and documentation-condition line counts

Across 660 trials with Claude Code, code cleanliness does not change the agent's pass rate.

Empirical evaluation: 660 trials run using Claude Code on the minimal-pair repos with hidden tests; reported comparison of pass rates between clean and messy repo variants showing no change.

high null result Does Code Cleanliness Affect Coding Agents? A Controlled Min... pass rate (task success on hidden tests)

Each output is scored with a unified rubric covering task completion, correctness, compliance, and clarity.

Measurement approach stated in the abstract (unified rubric with listed dimensions).

high null result Less Back-and-Forth: A Comparative Study of Structured Promp... evaluation_rubric

The study uses three LLM systems: ChatGPT, Claude, and Grok.

Method description in the paper's abstract naming the three LLMs evaluated.

high null result Less Back-and-Forth: A Comparative Study of Structured Promp... models_evaluated

The evaluation covers four task types: summarization, planning, explanation, and coding.

Method description in the paper's abstract listing the four task types used for evaluation.

high null result Less Back-and-Forth: A Comparative Study of Structured Promp... task_types_evaluated

The study compares three prompt conditions: a raw prompt, a checklist-improved prompt, and a clarifying-question prompt.

Experimental design described in the paper (three prompt conditions stated in the abstract).

high null result Less Back-and-Forth: A Comparative Study of Structured Promp... experimental_condition

Large language models (LLMs) are widely used for open-ended tasks.

Stated as background/context in the paper's introduction; no quantitative data reported in the abstract.

high null result Less Back-and-Forth: A Comparative Study of Structured Promp... use_of_llms_for_open_ended_tasks

We conduct extensive experiments on public datasets, in simulated auction environments, and through large-scale online deployment on Taobao.

Statement of experimental methodology describing the types of evaluations performed (public datasets, simulated auctions, and online deployment).

high null result Generative Auto-Bidding with Unified Modeling and Exploratio... scope and environments of experiments (public datasets, simulations, live deploy...

This study used a controlled mixed-design experiment with 60 participants who completed analytical survival ranking tasks in multi-turn human–AI collaborations, with pre/post measurements and two types of prompting training (general or sycophancy-focused).

Methodological description in the paper's abstract/summary.

high null result The Hidden Cost of Contextual Sycophancy: an AI Literacy Int... study design / methodological description

Reported empirical values are transformed through transparent indicators such as relative growth, CAGR, growth multipliers, stock-flow ratios, concentration ratios, and HHI.

Methodological description and application in the paper listing these specific indicators used to summarize public data on AI investment, adoption, robots, compute, and labour-market reallocation.

high null result The Agentic Economy: Humans, AI Agents, Robots, and the Meas... data transformation / indicator usage

The study uses a conceptual-empirical quantitative diagnostic design rather than a causal econometric model.

Explicit methodological statement in the paper describing the design choice and rejecting causal econometric modeling in favor of diagnostics using public institutional data and transparent indicators.

high null result The Agentic Economy: Humans, AI Agents, Robots, and the Meas... study methodology (diagnostic vs causal modeling)

The agentic economy is not yet a completed global order, but its transition pressure is measurable enough to require a distinct economic vocabulary, reproducible diagnostics, and future sector-level measurement.

Synthesis of diagnostic indicators (AI investment/adoption trends, robot stock, compute-energy coupling, labour reallocation measures) showing measurable transition pressures; conclusion drawn from the conceptual-empirical diagnostic.

high null result The Agentic Economy: Humans, AI Agents, Robots, and the Meas... degree of completion of 'agentic economy' transition / measurability of transiti...

Following PRISMA 2020 guidelines, searches across Google Scholar, Web of Science, Scopus, ScienceDirect, and CNKI yielded 1,562 initial records, of which 21 studies published between 2019 and 2026 met inclusion criteria.

Methodological description of the systematic literature review reported in the paper: initial records = 1,562; included studies = 21; publication years 2019–2026.

high null result Application of Artificial Intelligence in Human Resource Man... number of records screened and studies included

Small and medium-sized enterprises (SMEs) constitute over 98.5% of businesses in many economies including China.

Descriptive statistic reported in the paper's background/intro; source of the statistic not specified within the summary provided.

high null result Application of Artificial Intelligence in Human Resource Man... share of businesses that are SMEs

This study analyzes developments through April 2026.

Explicit timeframe statement in the paper's summary/introduction.

high null result AI for Auto-Research: Roadmap & User Guide temporal coverage of the review/analysis

The authors provide source code for their framework on GitHub to encourage further research.

Statement in the paper that the source code is available on GitHub; verifiable by visiting the repository (link not provided in the excerpt).

high null result Modelling Customer Trajectories with Reinforcement Learning ... availability of implementation/source code

Heuristics such as TSP and PNN are commonly used as inexpensive approximations for customer trajectories.

Descriptive claim about common practice cited in the paper; used as motivation for proposing the RL approach (no quantitative survey evidence provided in the excerpt).

high null result Modelling Customer Trajectories with Reinforcement Learning ... use of heuristic methods (TSP, PNN) for trajectory approximation

We conducted a randomized controlled experiment in which participants—analogs of early-career knowledge workers—were assigned to self-study a technical domain using either traditional resources or large-language-model (LLM) assistance.

Statement of experimental design in the paper (randomized controlled experiment assigning participants to either traditional resources or LLM assistance; participants described as analogs of early-career knowledge workers).

high null result Generative AI and the Productivity Divide: Human-AI Compleme... experimental assignment / study design (treatment vs control)

Results remain robust across checks.

Robustness checks reported by the authors (unspecified in abstract) that do not overturn the main findings.

high null result Dissipation of Debt Financing Privilege on Corporate AI Wash... robustness of core findings (debt financing cost increase for AI washing firms)

China's 14th Five Year Plan (FYP) is used as a quasi-natural experiment / strategic policy shock to study effects of AI washing.

Research design leverages the FYP announcement as an exogenous policy shock in a difference-in-differences framework (design claim; no sample size in abstract).

high null result Dissipation of Debt Financing Privilege on Corporate AI Wash... policy shock (use of FYP as quasi-experiment)

AI washing is identified as the residual between AI narrative intensity and patent output.

Constructed a firm-level AI washing proxy by regressing AI narrative intensity on patent output and using the residual; described as the study's measurement approach (no sample size reported in the abstract).

high null result Dissipation of Debt Financing Privilege on Corporate AI Wash... AI washing measure (residual between narrative intensity and patent output)

Behavioral findings from any single framework therefore warrant cross-configuration validation before being claimed as general.

Prescriptive conclusion derived from the observed cross-configuration heterogeneity in the paper's empirical results.

high null result Same Signal, Different Semantics: A Cross-Framework Behavior... validity/generalizability of behavioral findings across agent configurations

Framework identity accounts for more of the between-configuration variation than LLM family: for mean turns, framework explains 64% of the between-configuration variance against the LLM's 10%.

Variance decomposition / explained-variance analysis reported for 'mean turns' across configurations (reported percentages: 64% vs 10%).

high null result Same Signal, Different Semantics: A Cross-Framework Behavior... mean turns (average number of turns per task)

The analysis separates framework effects from LLM effects by holding each layer fixed in turn and measures one behavior–outcome effect per configuration to examine agreement across configurations.

Methods description in the paper: experimental design holding LLM or framework fixed to disentangle effects.

high null result Same Signal, Different Semantics: A Cross-Framework Behavior... behavior–outcome effects per configuration (methodological approach)

« Prev 1 2 3 … 65 66 67 … 281 282 Next »