Evidence (13827 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	195	97	889	1979
Governance & Regulation	815	391	188	121	1539
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	624	233	123	96	1084
Research Productivity	410	121	56	331	929
Output Quality	466	177	59	47	749
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	166	122	24	495
Task Allocation	206	64	70	31	376
Skill Acquisition	165	57	60	17	299
Innovation Output	201	27	41	18	288
Employment Level	105	51	107	13	278
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	149	46	26	3	224
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	61	20	12	182
Error Rate	69	91	10	2	172
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	92	19	13	19	145
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Skill Obsolescence	5	45	6	1	57
Creative Output	31	16	7	2	57
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Artificial intelligence enhances analytics, automates routine tasks, personalizes interactions, and supports decision-making.

Aggregate finding reported in the abstract based on thematic synthesis of the reviewed literature (160 articles).

high positive The implementation of artificial intelligence in organizatio... organizational_capabilities (analytics, automation, personalization, decision_su...

There are convergent patterns of AI adoption in human resources, marketing and customer services, logistics, and finance.

Synthesis claim from the systematic review of the 160 included peer‑reviewed articles as reported in the abstract.

high positive The implementation of artificial intelligence in organizatio... patterns_of_adoption_across_functions

Clear specifications, explicit governance, and ongoing human-AI collaboration are critical for successful scaling of regression automation.

Conclusions and recommendations derived from the case study's lessons and mixed-method evaluation.

high positive Human-AI Collaboration for Scaling Agile Regression Testing:... success of scaling regression automation / effectiveness of human-AI teaming

The Copilot achieves 30-50% code reuse when generating candidate test scripts.

Quantitative result reported in the paper's evaluation (stated 30-50% code reuse in the abstract/summary).

high positive Human-AI Collaboration for Scaling Agile Regression Testing:... code reuse in generated test scripts

Mixed-method evaluation shows the AI accelerates script authoring and increases throughput.

Empirical claim based on the paper's mixed-method evaluation (qualitative and quantitative data reported in the case study); specific sample sizes not provided in the summary.

high positive Human-AI Collaboration for Scaling Agile Regression Testing:... script authoring speed and throughput

Automated regression testing is essential for maintaining rapid, high-quality delivery in Agile and Scrum organizations.

Introductory/position statement in the paper; general premise motivating the case study (no specific empirical test reported).

high positive Human-AI Collaboration for Scaling Agile Regression Testing:... ability to maintain rapid, high-quality delivery

AIBuildAI ranks first on MLE-Bench with a medal rate of 63.1%, outperforming all existing baseline methods and matching the capability of highly experienced AI engineers.

Empirical evaluation on MLE-Bench reported in the paper (benchmark ranking, metric = medal rate).

high positive AIBuildAI: An AI Agent for Automatically Building AI Models medal rate (task success rate) on MLE-Bench

AIBuildAI adopts a hierarchical agent architecture in which a manager agent coordinates three specialized sub-agents: a designer for modeling strategy, a coder for implementation and debugging, and a tuner for training and performance optimization; each sub-agent is itself an LLM-based agent capable of multi-step reasoning and tool use, enabling end-to-end automation of the AI model development process that goes beyond the scope of existing AutoML approaches.

System architecture description in the paper (methods/architecture section).

high positive AIBuildAI: An AI Agent for Automatically Building AI Models system architecture and claimed capabilities (multistep reasoning, tool use, end...

We introduce AIBuildAI, an AI agent that automatically builds AI models from a task description and training data.

Methodological contribution: system design and implementation described in the paper (introduction/methods).

high positive AIBuildAI: An AI Agent for Automatically Building AI Models ability to produce AI models from task descriptions and training data

This tension reveals a pattern we call 'bounded delegation': developers wanted AI to absorb the assembly work surrounding their craft, never the craft itself.

Interpretive result from the paper's qualitative thematic analysis of survey responses (n=860), labeled by the authors as the 'bounded delegation' pattern.

high positive To Copilot and Beyond: 22 AI Systems Developers Want Built preferred boundary of automation / delegation

Developers wanted systems enforcing explicit authority scoping, provenance, uncertainty signaling, and least-privilege access throughout.

Reported constraints and desiderata from the thematic analysis of survey responses (n=860).

high positive To Copilot and Beyond: 22 AI Systems Developers Want Built desired governance/security features for AI tools (authority scoping, provenance...

Developers wanted systems that embed quality signals earlier in their workflow to keep pace with accelerating code generation.

Thematic findings from the paper's human-in-the-loop, multi-model council-based analysis of survey responses (n=860).

high positive To Copilot and Beyond: 22 AI Systems Developers Want Built requested placement/timing of quality signals in developer workflow

Using a human-in-the-loop, multi-model council-based thematic analysis, we identify 22 AI systems that developers want built across five task categories.

Qualitative analysis method described in the paper applied to the survey responses (n=860); result reported as identification of 22 desired AI systems organized into five categories.

high positive To Copilot and Beyond: 22 AI Systems Developers Want Built catalog of desired AI systems and task categories

For listed firms, AI patents command a robust market-value premium in both countries.

Firm-level analysis linking AI patenting to market valuation for listed firms in both countries (regression or valuation analysis implied by statement).

high positive AI Patents in the United States and China: Measurement, Orga... market-value premium for listed firms associated with AI patents

China surpasses the United States in recent annual AI patent counts.

Time-series patent count comparison using classifier-applied corpora (paper reports that recent annual counts are higher for China than the U.S.).

high positive AI Patents in the United States and China: Measurement, Orga... annual number of AI patents (patent counts)

There is broad convergence in AI patenting intensity and subfield composition between the United States and China.

Comparative analysis of AI patenting intensity and subfield composition across the two patent corpora (US 1976-2023, China 2010-2023) reported in paper.

high positive AI Patents in the United States and China: Measurement, Orga... AI patenting intensity and distribution across AI subfields

Applying the classifier to granted U.S. patents (1976-2023) and Chinese patents (2010-2023), we document rapid growth in AI patenting in both countries.

Application of classifier to full corpora of granted U.S. patents (1976-2023) and Chinese patents (2010-2023); time-series counts of AI patents reported.

high positive AI Patents in the United States and China: Measurement, Orga... number of granted AI patents over time (patent counts)

The classifier generalizes well to Chinese patents based on citation and lexical validation.

Validation analyses described as citation-based and lexical validation applied to Chinese patents (paper states generalization to Chinese patents via these validation methods).

high positive AI Patents in the United States and China: Measurement, Orga... generalization / validity of classifier on Chinese patents

Our classifier substantially improves the existing USPTO approach, achieving 97.0% precision, 91.3% recall, and a 94.0% F1 score.

Reported classifier evaluation metrics (precision, recall, F1) presumably on held-out test data; comparison stated against the existing USPTO approach.

high positive AI Patents in the United States and China: Measurement, Orga... classification performance (precision, recall, F1)

We develop a high-precision classifier to measure artificial intelligence (AI) patents by fine-tuning PatentSBERTa on manually labeled data from the USPTO's AI Patent Dataset.

Methodological description in paper: fine-tuning PatentSBERTa on manually labeled USPTO AI Patent Dataset (manually labeled training data and model fine-tuning stated).

high positive AI Patents in the United States and China: Measurement, Orga... ability to classify patents as AI-related (classifier development)

The results demonstrate the importance of considering interacting systems of AI agents when doing both capabilities and safety research.

Authors' interpretation/generalization based on experimental findings comparing multi-agent organizations and single agents across tasks and settings.

high positive AI Organizations are More Effective but Less Aligned than In... research priorities/considerations for capabilities and safety research (implica...

BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility.

Design claim in abstract describing the benchmark's automated scoring system and rubric size (100+ criteria) defined by expert bankers.

high positive BankerToolBench: Evaluating AI Agents in End-to-End Investme... number of rubric criteria for automated evaluation

Substituting subjective human preference with rigorous economic penalties provides a robust methodology for aligning autonomous agents in high-stakes, real-world environments.

Conclusion drawn from the authors' empirical study and the reported final-system performance; presented as a general methodological claim (supporting data referenced in paper but not detailed in excerpt).

high positive OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Al... effectiveness of economic penalties as an alignment method

The final OOM-RL-aligned system achieved a stable equilibrium with an annualized Sharpe ratio of 2.06 in its mature phase.

Quantitative performance result reported for the mature phase of the system in the paper's abstract; Sharpe ratio provided as a single-number metric (no sample size, number of trading periods, or statistical significance reported in the excerpt).

high positive OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Al... annualized Sharpe ratio

The MAS abandoned overfitted hallucinations in favor of the Strict Test-Driven Agentic Workflow (STDAW), which enforces a Byzantine-inspired uni-directional state lock (RO-Lock) anchored to a deterministically verified ≥95% code coverage constraint matrix.

Design and outcome claim in the paper: introduction of STDAW/RO-Lock and reported enforcement of a ≥95% code coverage constraint as part of the aligned architecture (qualitative + a coverage threshold stated).

high positive OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Al... code coverage (>=95%) and reduction in hallucinations / overfitting

The system evolved from a high-turnover, sycophantic baseline to a robust, liquidity-aware architecture over the course of the study.

Reported longitudinal observations from the 20-month empirical study described in the paper (qualitative system evolution claim; no numeric counts provided in excerpt).

high positive OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Al... system architecture and behaviour (turnover rate, sycophancy, liquidity awarenes...

We introduce Out-of-Money Reinforcement Learning (OOM-RL): deploying agents into the non-stationary, high-friction reality of live financial markets to utilize capital depletion as an un-hackable negative gradient.

Methodological claim / novel paradigm introduced by the paper; described as implemented in the study (no numerical sample size given in excerpt).

high positive OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Al... use of financial loss (capital depletion) as negative training signal for agent ...

Established regional telcos and banks are leveraging proprietary data to develop digital loan products.

Observations and interviews from the nine-month ethnography describing practices of regional telcos and banks in Nairobi developing digital loan products using proprietary data.

high positive Risk, Data, Alignment: Making Credit Scoring Work in Kenya use of proprietary data by telcos and banks to create digital loan products (ado...

For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict

Explicit reproducibility statement and URL provided in the paper.

high positive SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... data_and_code_availability

SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process?

Statement of research goals and scope in the paper introducing the SciPredict benchmark and accompanying evaluations.

high positive SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... research_questions_addressed

Human experts demonstrate strong calibration: their accuracy increases from ≈5% to ≈80% as they deem outcomes more predictable without conducting the experiment.

Reported stratified accuracy of human experts on SciPredict tasks by self-reported predictability judgments; accuracy rises from ≈5% (when judged not predictable) to ≈80% (when judged predictable).

high positive SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... calibration_of_human_confidence_vs_accuracy

We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry.

Construction of the SciPredict benchmark described in the paper; explicitly reports 405 tasks and 33 sub-fields.

high positive SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... benchmark_size_and_scope

The future of Nagpur's industrial belt depends not on resisting automation, but on an aggressive reskilling strategy to bridge the gap between current workforce capabilities and future technological requirements.

Normative policy conclusion in the paper recommending reskilling as the primary response; based on the paper's analysis of task changes and projected role shifts; no program evaluation or empirical evidence of reskilling effectiveness reported in the excerpt.

high positive PREDICTING THE FUTURE OF JOBS IN NAGPUR DISTRICT MIDC: THE R... need for reskilling / workforce skill acquisition

There is a projected surge in demand for 'AI-collaborative' roles such as machine maintenance, data supervision, and process optimization.

Projection in the paper based on analysis of task complementarities between humans and AI, listing specific roles expected to grow; no quantitative demand estimates or sample sizes provided in the excerpt.

high positive PREDICTING THE FUTURE OF JOBS IN NAGPUR DISTRICT MIDC: THE R... projected demand for AI-collaborative roles (machine maintenance, data supervisi...

The paper documents 14 deliberate conservative assumptions — including frozen base GDP, no AI-on-AI compounding, a permanent friction floor, and conservative capture rates — all of which directionally understate the benefit.

Paper lists 14 conservative modeling assumptions and claims they bias results downward (i.e., understate potential benefits).

high positive AI Capex Is Justified: A Bottom-Up Sectoral Estimate of Arti... directional bias of model assumptions relative to potential benefits

Even excluding demand expansion and robotics layers entirely, the direct productivity contribution alone reaches approximately $940 billion per year by 2036.

Model output reported in the paper when removing demand expansion and robotics layers.

high positive AI Capex Is Justified: A Bottom-Up Sectoral Estimate of Arti... direct productivity contribution to annual GDP by 2036 excluding demand expansio...

In all four scenarios, cumulative net GDP exceeds cumulative AI infrastructure investment before 2036, with the base case achieving payback in 2033.

Model financial calculation comparing cumulative net GDP uplift to cumulative AI infrastructure investment across scenarios; explicit payback year reported for base case.

high positive AI Capex Is Justified: A Bottom-Up Sectoral Estimate of Arti... year when cumulative net GDP exceeds cumulative AI infrastructure investment (pa...

The base-case scenario yields approximately $1,057 billion in net annual GDP uplift by 2036, equivalent to 3.6 percent of 2024 GDP; the bear case produces $796 billion, the bull case $1,368 billion, and an agentic scenario produces $2,521 billion.

Model scenario outputs presented in the paper (four scenarios differentiated by capture rate and friction assumptions).

high positive AI Capex Is Justified: A Bottom-Up Sectoral Estimate of Arti... net annual GDP uplift by 2036 (US, scenario-specific)

Sector-specific productivity gain percentages are anchored to published evidence, including a randomized controlled trial of GitHub Copilot (Kalliamvakou et al., 2023), JPMorgan CEO disclosures, and Cognizant's New Work New World 2026 research.

Paper states productivity percentages are anchored to published evidence and specifically cites Kalliamvakou et al. (2023) RCT, JPMorgan CEO disclosures, and Cognizant (2026).

high positive AI Capex Is Justified: A Bottom-Up Sectoral Estimate of Arti... sector-specific productivity gain percentages used in the model

A configuration-driven domain model means deploying a new institutional decision domain requires YAML configuration, not engineering capacity.

Design/implementation claim in paper describing deployment approach using YAML configuration rather than engineering work.

high positive Governed Reasoning for Institutional AI deployment effort required to support a new institutional decision domain

We introduce governability — how reliably a system knows when it should not act autonomously — as a primary evaluation axis for institutional AI alongside accuracy.

Conceptual contribution/metric proposed by authors in paper; no empirical validation reported in the excerpt.

high positive Governed Reasoning for Institutional AI governability (system's ability to know when not to act autonomously)

Cognitive Core produced zero silent errors while both baselines produced 5-6 silent errors on the evaluation set.

Empirical benchmark reported in paper on the 11-case evaluation set; counts of silent errors given for Cognitive Core and baselines.

high positive Governed Reasoning for Institutional AI count of silent errors (incorrect determinations that executed without human-rev...

Cognitive Core achieves 91% accuracy on the 11-case prior authorization appeal set, versus 55% for ReAct and 45% for Plan-and-Solve.

Empirical benchmark reported in paper on the 11-case evaluation set; accuracies explicitly stated for three systems.

high positive Governed Reasoning for Institutional AI accuracy on prior authorization appeal cases

We propose Cognitive Core: a governed decision substrate built from nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), a four-tier governance model where human review is a condition of execution rather than a post-hoc check, a tamper-evident SHA-256 hash-chain audit ledger endogenous to computation, and a demand-driven delegation architecture supporting both declared and autonomously reasoned epistemic sequences.

Design/proposal described in paper (architectural specification); no empirical evaluation reported for the architecture itself in the excerpt.

high positive Governed Reasoning for Institutional AI system governability and auditability as properties of the decision substrate

Organisations should invest in customisation capabilities for AI recruitment tools, implement comprehensive change management strategies, and maintain robust post-hire evaluation procedures.

Authors' recommendations derived from thematic findings and participant perspectives across two firms (qualitative synthesis of n = 22 interviews).

high positive The augmented recruiter: examining AI integration and decisi... recommended_organisational_practices_for_AI_recruitment

AI functioned optimally as an augmentative technology rather than as a replacement for human decision-makers in recruitment.

Findings: participants across the two case firms described AI being most effective when augmenting human judgment rather than replacing it (interviews n = 22).

high positive The augmented recruiter: examining AI integration and decisi... role_of_AI (augmentation vs replacement)

AI significantly enhanced efficiency through process standardisation and automation.

Findings based on participant accounts in thematic analysis (interviews n = 22) describing process optimisation and automation benefits.

high positive The augmented recruiter: examining AI integration and decisi... efficiency (process standardisation and automation)

The Principle of Maximum Heterogeneity reveals a convergence of complex phenomena across fields onto simple underlying design principles with important predictive value for future distributed production systems.

Synthesis claim in the paper arguing cross-field convergence and predictive value based on the theoretical model and conceptual examples; no empirical validation or forecasting trials reported.

high positive The Principle of Maximum Heterogeneity Optimises Productivit... predictive value of the model/principles for future distributed production syste...

The principles derived (including the Principle of Maximum Heterogeneity) can be used as a blueprint for constructing ideal distributed production systems; demonstrated by suggesting specific redesigns for compute systems executing large-scale AI.

Paper includes suggested redesigns for compute systems as demonstrations of the blueprint; these are proposed designs/illustrative applications rather than empirically validated interventions or trials.

high positive The Principle of Maximum Heterogeneity Optimises Productivit... design-guided performance improvements in compute systems for large-scale AI (pr...

The Principle of Maximum Heterogeneity applies recursively across all layers of nested production systems.

Theoretical claim within the paper arguing recursive applicability across nested system layers (e.g., neurons, firms, ecosystems); supported by conceptual reasoning and model exposition rather than empirical multi-layer tests.

high positive The Principle of Maximum Heterogeneity Optimises Productivit... emergence/spread of heterogeneity across nested layers

« Prev 1 2 3 … 145 146 147 … 276 277 Next »