Evidence (14055 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

This study analyzes 64,380 SWE-bench runs from 126 agent configurations spanning 43 frameworks, where each configuration pairs an LLM with a framework supplying tools and workflow.

Dataset and experimental design reported in the paper: 64,380 runs; 126 configurations; 43 frameworks.

high null result Same Signal, Different Semantics: A Cross-Framework Behavior... number of benchmark runs / experimental scale

The paper's contribution is an evaluation and benchmark paradigm (discipline stability / trace-based evaluation), not a new optimizer or a universal claim about MARL.

Author statement in the abstract/summary clarifying the contribution is methodological (evaluation/benchmark) rather than proposing a new optimizer or making universal claims about multi-agent RL.

high null result When Outcome Looks Right But Discipline Fails: Trace-Based E... scope of contribution (evaluation paradigm vs. optimizer/new universal claim)

The formal semantics and proof-checked admission model are specified and under active development, with evaluation of the verified core reserved for future work.

Author statement in the paper about the current development status and that evaluation of the verified core is deferred to future work.

high null result GraphFlow: An Architecture for Formally Verifiable Visual Wo... development status and lack of current evaluation

Reward is non-positive in the CybORG CAGE-2 environment, so all configurations operate in a failure-mitigation mode.

Environment specification reported in the paper (CybORG CAGE-2 modeled as a POMDP with non-positive reward structure).

high null result Context, Reasoning, and Hierarchy: A Cost-Performance Study ... sign and interpretation of reward

The evaluation spanned five model families, six models, and twelve configurations, totaling 3,475 episodes with token-level cost accounting.

Methods description in the paper reporting the experimental design and sample counts.

high null result Context, Reasoning, and Hierarchy: A Cost-Performance Study ... study scope (models, configurations, episodes)

Skills can be mapped into three categories: those AI is absorbing, those needed to work alongside AI today, and those that make humans irreplaceable tomorrow.

Conceptual taxonomy offered in the chapter, based on labour market data and workplace evidence; presented as an analytical framework rather than a quantified finding.

high null result 7. AI and the Future of Work classification of skills relative to AI impact

Fear and hype about technological transitions are temporary.

One of five lessons drawn from historical analogy and labour market history as presented in the chapter.

high null result 7. AI and the Future of Work duration of public fear/hype following technological change

Virtually every job is being touched by AI.

Stated in chapter summary; claimed on the basis of labour market data and emerging workplace evidence (no numeric sample given in excerpt).

high null result 7. AI and the Future of Work incidence of AI affecting jobs

Only 9% of jobs are fully automatable.

Reported directly in chapter; based on labour market data (specific data source and sample size not stated in the excerpt).

high null result 7. AI and the Future of Work share of jobs fully automatable

AI automates tasks, not jobs.

Conceptual argument in chapter drawing on labour market data and historical analogy; presented as a framing claim rather than a specific empirical estimate.

high null result 7. AI and the Future of Work unit of automation (tasks vs jobs)

These factors evolve over time, have inter-dependencies across multiple resource dimensions, and generally do not lend themselves to closed-form analysis.

Methodological observation motivating simulation/sequence-based evaluation; asserted in the paper's rationale.

high null result Designing Datacenter Power Delivery Hierarchies for the AI E... tractability of closed-form analysis for power delivery design

Higher sectoral digitalization potential (telework feasibility and digital intensity) does not significantly affect aggregate employment levels.

Difference-in-differences (DiD) analysis using the COVID-19 shock as a quasi-natural experiment on a quarterly panel for 27 EU Member States (2018–2024), N = 36,685; reported DiD coefficient = 0.06, p ≈ 0.98.

high null result Digital transformation and labor market indicators in the EU... aggregate employment levels

The study used a structured questionnaire (five-point Likert) administered to employees in AI-enabled organizations across various sectors and analyzed the data using SPSS (descriptive statistics, reliability analysis, correlation analysis, regression analysis).

Methods section summary provided in the paper (survey instrument description and analytical techniques).

high null result Opportunities and Challenges of Human- AI Collaboration in W... methodological approach / data collection and analysis procedures

The convergence properties of the explore-then-exploit pricing pipeline can be characterized via a fluid-limit ordinary differential equation (ODE) analysis.

Analytical method used in the paper: fluid-limit ODE analysis applied to the multi-firm explore-then-exploit model to study convergence.

high null result Misspecified Explore-then-Exploit Leads to Supra-Competitive... convergence behavior of prices under the pricing pipeline

Firms following an explore-then-exploit pipeline randomize prices during an initial exploration phase, then estimate demand from their own historical data and set prices myopically thereafter; the estimation relies on a misspecified, monopoly-style model that omits competitors' prices.

Model specification and assumptions described in the paper (methodological setup).

high null result Misspecified Explore-then-Exploit Leads to Supra-Competitive... pricing algorithm structure (exploration then myopic exploitation based on missp...

We evaluate PRISM across 35 enterprise conversational agents over a three-week deployment period on the Yellow.ai V3 platform.

Statement in abstract: evaluation across 35 agents over a three-week deployment on Yellow.ai V3 platform (empirical deployment described).

high null result PRISM: Prompt Reliability via Iterative Simulation and Monit... deployment evaluation sample and duration

A four-dimensional Flexibility Index is developed to assess reallocation authority, forecast cycles, AI integration, and transparency.

Methods section: construction of an index with four dimensions (reallocation authority, forecast cycles, AI integration, transparency).

high null result Budgeting for Agility: A Cross-Sectoral Analysis of Fiscal F... budget flexibility (measured via Flexibility Index)

The analysis draws on Form 10-K filings from Microsoft, Johnson & Johnson, Procter & Gamble, and ExxonMobil (2019–2023), alongside public sector data from the Open Budget Survey 2023, the OECD Budget Practices Database, and U.S. GAO oversight reports.

Methods/data section listing data sources and firm sample (four named firms, 2019–2023) and public datasets.

high null result Budgeting for Agility: A Cross-Sectoral Analysis of Fiscal F... data sources and sample composition

The study investigates the non-linear impact of AI on economic growth in 19 G20 countries (2005–2023) using the Generalized Method of Moments (GMM) with both linear and quadratic models.

Methodological description provided in the paper: panel dataset covering 19 G20 countries over 2005–2023 and estimation via GMM with linear and quadratic specifications.

high null result Artificial intelligence and economic growth in G20 economies... other

The paper constructs estimators for the own-adoption, spillover, and total effects and an inference procedure that allows for spatial dependence.

Presentation of concrete estimators and an inference procedure in the paper; the inference approach explicitly accommodates spatial dependence (methodological contribution).

high null result Identification and Estimation of Staggered Difference-in-Dif... estimator definitions and inference procedure robustness to spatial dependence

Spillover effects are learned from never-treated units and evaluated for treated cohorts under the exposure distribution they face.

Methodological procedure in the paper: estimation of spillover effects using never-treated units as the source of variation, then applying those estimates to treated cohorts based on their observed exposure distributions.

high null result Identification and Estimation of Staggered Difference-in-Dif... spillover effect estimation strategy (learning from never-treated units)

Identification uses a prespecified summary of spillover exposure and parallel trends comparisons among units with the same exposure at the baseline and target dates.

Identification strategy articulated in the paper: assumption of a prespecified exposure summary and use of parallel trends comparisons conditional on equal exposure profiles at baseline and event dates.

high null result Identification and Estimation of Staggered Difference-in-Dif... identification of causal effects under specified exposure summaries and parallel...

For each treated cohort and event time, the framework separates the effect of own adoption, the spillover effect generated by other adopters, and the total effect under the realized rollout.

Analytical decomposition provided in the paper that defines separate estimands for (i) own-adoption effect, (ii) spillover effect from other adopters, and (iii) total realized effect for cohorts and event times.

high null result Identification and Estimation of Staggered Difference-in-Dif... decomposition of treatment effects into own adoption, spillover, and total effec...

The paper develops a difference-in-differences framework for staggered policy adoption when units can be affected by other units' adoption.

Theoretical development in the paper: presentation of a DID framework that explicitly allows units to be affected by other units' adoption (methodological derivation and formal description).

high null result Identification and Estimation of Staggered Difference-in-Dif... availability of an econometric framework for staggered adoption with spillovers

IIQ is positioned as a deployment-oriented measurement framework: a formal proposal for tracking AI embedding in workflows, not a direct measure of model capability or a substitute for causal productivity evaluation.

Explicit positioning statement in paper: authors state scope and limits of IIQ as deployment/usage metric rather than capability or causal productivity estimator (conceptual/positioning).

high null result Intelligence Impact Quotient (IIQ): A Framework for Measurin... scope/limitations (not measuring model capability or causal productivity)

Sources were selected purposively through explicit inclusion and exclusion criteria tied to conceptual relevance, scholarly quality, and direct contribution to framework building; higher-order categories were retained only after iterative comparison across the four literature streams.

Author-reported sampling and analytic procedure for the integrative review.

high null result RegTech-enabled governance of sanctions-safe enterprise ecos... review source selection and analytic procedure

Methodologically, the paper uses a structured integrative review combined with interpretive theory synthesis to connect literature on RegTech, sanctions compliance, institutional voids, supply chain governance, and algorithmic accountability.

Explicit methodological description in the paper (authors' stated approach).

high null result RegTech-enabled governance of sanctions-safe enterprise ecos... methodological approach used

Existing studies on regulatory technology mainly present it as a firm-level compliance tool, giving little attention to its role in shaping coordination across wider enterprise ecosystems in post-conflict and sanctions-affected settings.

Review finding based on purposive selection and comparison of literature on RegTech and related fields (method: structured integrative review and interpretive theory synthesis).

high null result RegTech-enabled governance of sanctions-safe enterprise ecos... scope of RegTech literature (firm-level focus vs ecosystem coordination)

The study uses World Bank Enterprise Survey firm-level data from 2007 to 2024 and employs feasible generalized least squares (FGLS), robust ordinary least squares (OLS), and high-dimensional fixed effects (HDFE) linear regression techniques.

Direct methodological statement in the paper's abstract/summary. This is a descriptive factual claim about data and methods.

high null result Estimation of Firm Labour Productivity and Sales Growth from... data source and econometric methods

AI deployment has limited effects on retrial rates.

Same randomized field experiment; retrial rates (repeat customer contacts) were measured and reported as showing limited/no substantive change under AI deployment.

high null result Agentic AI and Human-in-the-Loop Interventions: Field Experi... retrial rates (repeat contact rate)

The findings are based on India-focused samples.

Paper explicitly notes the sample/context is India-focused.

high null result Enhancing Forensic Accounting Practice: A Proactive Risk Man... geographic scope of sample

PRIF was developed and validated using mixed-method design: interviews with 30 risk advisors, case studies, and analysis of 30 forensic reports, with validation via thematic coding, risk metrics, and Delphi panel refinement.

Reported methods in the paper: mixed-method design including 30 risk advisor interviews and analysis of 30 forensic reports; validation methods named (thematic coding, risk metrics, Delphi panel).

high null result Enhancing Forensic Accounting Practice: A Proactive Risk Man... methodological validation and sample description

Five structural characteristics define the Metis AI zone: consequential irreversibility, relational irreducibility, normative open texture, adversarial co-evolution, and accountability anchoring.

Theoretical specification and definition of five characteristics grounded in social science, philosophy, and humanitarian practice; no empirical prevalence or measurement reported.

high null result Metis AI: The Overlooked Middle Zone Between AI-Native and W... defining properties of Metis tasks

The dominant discourse on AI limitations frames the boundary of AI capability as a divide between digital tasks (where AI excels) and physical tasks (where embodiment is required).

Statement in paper framing prevailing discourse; conceptual observation rather than empirical test (literature critique). No sample size reported.

high null result Metis AI: The Overlooked Middle Zone Between AI-Native and W... framing of AI capability boundary

Including the 2020-2021 COVID-19 lockdowns allows leveraging the pandemic to isolate structural inequalities from transient market shocks.

Design choice: use of data spanning 2016–2021, including pandemic lockdown period, to separate persistent structural disparities from short-term shock effects.

high null result The Broken Shield of European Palliative Care: Evidence from... Ability to distinguish structural inequalities from transient shocks using pre/p...

Neither survey nor transcript-based measures of participation equity improved under LLM facilitation (an "illusion of inclusion").

Quantitative survey measures and transcript-based analyses of participation equity (e.g., measures of turn-taking, speaking/typing share) showed no improvement in equity metrics for facilitated conditions compared to controls across the experiments.

high null result Real-Time Group Dynamics with LLM Facilitation: Evidence fro... participation equity (survey and transcript-derived measures of participation ba...

Across both studies, LLM facilitation did not significantly improve group consensus.

Experimental comparison across the two studies (total N=879) measuring agreement/consensus metrics for groups randomized to LLM facilitation versus other facilitators or no facilitation; reported null effect on consensus.

high null result Real-Time Group Dynamics with LLM Facilitation: Evidence fro... group consensus (agreement level among group members)

Study 2 (N=675) compares facilitator strategies against a no-facilitation baseline.

Study 2 comprised N=675 participants (groups of three) randomized to different LLM facilitation strategies and a no-facilitation control.

high null result Real-Time Group Dynamics with LLM Facilitation: Evidence fro... comparison of facilitation strategies vs no-facilitation

Study 1 (N=204) compares three frontier LLMs as facilitators.

Study 1 comprised N=204 participants (groups of three) randomized to facilitator conditions comparing three frontier language models.

high null result Real-Time Group Dynamics with LLM Facilitation: Evidence fro... comparison of facilitator LLM models

We present two empirical studies (N=879) of real-time, text-based group deliberation in an incentive-compatible charity allocation task with real financial stakes ($7,200 USD).

Two online experiments involving real-time, text-based group deliberation. Total participants N=879 in groups of three; total monetary stakes for the charity allocation task equal $7,200 USD.

high null result Real-Time Group Dynamics with LLM Facilitation: Evidence fro... experiment setup (incentive-compatible charity allocation, total stakes $7,200 U...

The study used a qualitative interpretivist research design drawing on semistructured interviews with 28 managers and professionals from 12 organizations across technology, finance and knowledge-intensive service sectors in Europe and Asia, using thematic and interpretive analysis supported by organizational document review.

Methodology statement from the paper (explicit description of sample, sectors, regions and analytic approach).

high null result Reimagining work in the age of intelligent automation: a qua... research design and sample characteristics

AI should be conceptualized as a co-evolving organizational capability rather than a deterministic technology.

Argument developed from interpretive analysis of interview data (n=28), literature engagement and organizational document review.

high null result Reimagining work in the age of intelligent automation: a qua... conceptual framing of AI within organizations

The study develops an emergent framework of AI–human co-adaptation comprising three interrelated dimensions: technological alignment, cognitive calibration and ethical anchoring.

Framework derived from thematic/interpretive analysis of interview data (n=28) and supporting organizational documents.

high null result Reimagining work in the age of intelligent automation: a qua... dimensions of AI–human co-adaptation

The paper introduces the concept of 'augmented work agency' as a multi-level, interpretive form of human agency in algorithmically mediated environments.

Conceptual development within the paper grounded in literature review and qualitative interview data (28 participants) and organizational document review.

high null result Reimagining work in the age of intelligent automation: a qua... agency, control and coordination in algorithmic workplaces

This study used a three-wave lagged survey design with 381 valid matched employees from knowledge-intensive firms in China.

Methods statement in paper reporting study design and sample composition: three-wave lagged survey and 381 valid matched employee responses from knowledge-intensive Chinese firms.

high null result The impact of generative artificial intelligence (GenAI) usa... study sample and design (methodological description)

The overall impact of prompt design on readability remains limited.

Reported results from prompt-dimension experiments indicating that while some prompt elements influence readability, the aggregate effect size of prompt engineering on overall readability was limited.

high null result The Readability Spectrum: Patterns, Issues, and Prompt Effec... overall_effect_of_prompt_design_on_readability

Current LLMs produce code with overall readability comparable to human-written code.

Comparison of readability scores (from the paper's readability model) between LLM-generated code and human-written code across 5,869 scenarios; reported summary conclusion that overall readability is comparable.

high null result The Readability Spectrum: Patterns, Issues, and Prompt Effec... code_readability (overall/readability score)

The analysis proceeded through within-case coding and cross-case pattern matching across five dimensions: intelligence source, AI mechanism, decision domain, economic implication, and boundary condition.

Method section describing coding and analytical procedures applied to the archival corpus across the four cases.

high null result Artificial Intelligence Enabled Competitive Intelligence as ... analytic method (coding and cross-case pattern matching across specified dimensi...

The empirical corpus comprises annual reports, 10-K filings, earnings releases, and official corporate materials published mainly between 2024 and 2026, complemented by recent peer-reviewed literature.

Paper's data description listing document types and time window for archival evidence; number of documents not enumerated.

high null result Artificial Intelligence Enabled Competitive Intelligence as ... composition and timeframe of empirical corpus (document types and years)

The study adopts a qualitative comparative multiple-case design using four theoretically sampled cases: Walmart, Unilever, Sprinklr, and DoubleVerify.

Methodological statement in the paper describing case selection and study design.

high null result Artificial Intelligence Enabled Competitive Intelligence as ... study design and sample (case selection)

« Prev 1 2 3 … 66 67 68 … 281 282 Next »