Evidence (4114 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Innovation
Remove filter
By placing networked IoT sensors in factories, trucks, storage sites, and upstream suppliers, real-time data were paired with machine-learning routines to schedule preventive maintenance, forecast orders, and guide blockchain tracking, routing adjustments, and automated decisions balancing green goals with everyday performance.
Paper description of system design and interventions: placement of sensors across supply chain nodes and pairing with ML routines for maintenance, forecasting, blockchain tracking, routing, and automated decisions.
TechToken improves general representation quality, outperforming state-of-the-art models across different patent-related tasks.
Benchmarking experiments across multiple patent-centered downstream tasks comparing TechToken to state-of-the-art models, with reported outperformance (statement given in abstract). Exact tasks, metrics, and sample sizes are not specified in the abstract.
Context similarity between code embeddings, defined as a measure of linguistic convergence, accurately predicts first technological combinations.
Operationalization of 'context similarity' between IPC-code embeddings and empirical validation showing it predicts first-time combinations of technologies (as claimed in abstract); implies out-of-sample prediction or event-detection experiments on patent combination events.
We introduce TechToken, a transformer-based model that treats technologies, classified by International Patent Classification (IPC) codes, as words in its vocabulary, learning the language of technologies by embedding these codes during fine-tuning.
Methodological contribution described in the paper: architecture is a transformer model with IPC codes tokenized and embedded during fine-tuning on patent data (method statement from abstract).
Forthcoming combinations leave an early trace in the collective language of patents, with predictive signals detectable even decades in advance.
Temporal analysis of patent-language embeddings showing predictive signals preceding first occurrences of technological combinations; described as detectable 'even decades in advance' across the patent corpus (statement in paper abstract). Specific methods likely include embedding IPC codes and measuring changes in context similarity over time; sample described qualitatively as spanning many patents (abstract mentions 'thousands of patents').
For government policy, it is necessary to establish precise dynamic intervention and orderly exit mechanisms to effectively govern the computing power innovation ecosystem.
Policy implication drawn from the model's analysis of equilibria and regime transitions, and numerical experiments indicating path-dependent/regime-dependent outcomes under different regulatory strategies (method: theoretical model implications + simulation).
A leading computing power incumbent could strengthen its ecological niche and maintain its role as an industry cornerstone by opening its underlying interfaces and software stacks while remaining integrated.
Implication derived from the model's strategic equilibrium analysis and simulations regarding incumbents' strategies for preserving niche/market position (method: evolutionary game analysis + simulations).
Downstream AI firms may benefit from advancing vertical integration, achieving hardware–software co-optimization through self-developed domain-specific architectures.
Result of the theoretical model (tripartite evolutionary game) and numerical simulation experiments showing advantages to downstream innovators when pursuing vertical integration and co-optimization (method: theoretical model + simulation).
Mechanism tests reveal efficiency gains via automation are a key pathway by which AI increases productivity in constrained firms.
Mechanism analysis reported in the paper (tests linking AI adoption to automation-related efficiency improvements in constrained firm clusters).
Firms constrained by limited intangibles, outdated hardware, or weak human capital benefit most from AI adoption when AI mitigates bottlenecks (i.e., larger positive TFP effects for resource-constrained firms).
Subgroup/cluster-specific estimates from panel analysis showing larger productivity gains in clusters characterized by limited intangibles, outdated hardware, or weak human capital.
The frontier for AI-augmented science is not acceleration; it is the redesign of the certifying infrastructure around these new scarcities.
Prescriptive conclusion in the paper arguing priority of institutional redesign over mere speed gains; presented without empirical testing in the excerpt.
Competent-looking judgment, including selecting, ranking, attributing, and certifying, is now produced at scale at marginal cost approaching zero, inverting the dominant economics-of-AI reading that treats judgment as the scarce complement to cheap prediction.
Argumentative/theoretical claim in the paper; no empirical sample, experiment, or quantitative data reported in the excerpt (implicit basis: observation of scalable AI outputs).
Policy recommendations: invest in digital infrastructure, human capital development, and inclusive technology diffusion strategies to ensure more equitable distribution of AI-driven economic value.
Policy implications drawn from study findings (heterogeneous effects and mediation by structural conditions).
The magnitude of AI's growth effects varies across economic contexts: developed economies experience substantially stronger growth impacts (approximately 0.33) than emerging economies (approximately 0.15).
Heterogeneity analysis / subgroup comparisons (developed vs emerging economies) using the panel data regressions and/or quantile regressions on the 2015–2024 dataset; exact sample sizes per subgroup not reported.
AI adoption has a comparatively weaker direct effect on economic growth (direct effect β = 0.09).
Mediation/structural decomposition from the paper showing direct (non-mediated) coefficient from AI adoption to growth.
Agentic AI influences economic growth primarily through a productivity channel (mediated effect β = 0.35, p < 0.01).
Mediation analysis (panel data) estimating indirect effect of AI adoption on GDP growth via measured productivity channel; data sources: World Bank and OECD indicators, 2015–2024.
AI adoption significantly improves firm-level productivity (β = 0.18, p < 0.01).
Fixed-effects panel regression using an AI Adoption Index as predictor on firm-level productivity; data drawn from World Bank (World Development Indicators and Enterprise Surveys) and OECD AI indicators for 2015–2024 (sample size not reported in text).
Agentic AI has strong potential to boost productivity and growth.
Statement in paper motivated by literature review and the study's empirical results linking AI adoption to productivity and growth.
The field's near-term research agenda should explicitly include collecting and using triadic data.
Normative recommendation in the paper; presented as the authors' advised research priority rather than empirically justified within the excerpt.
This data is the empirical key to four open questions in agent training.
Argumentative claim in the paper asserting centrality of triadic data to addressing unspecified four open research questions; no empirical demonstration included in the excerpt.
This triadic data is capturable in 12-18 months with methods already mature in adjacent fields.
Claim in the paper based on authors' assessment of methodological maturity in adjacent fields; no empirical project timeline or pilot data is provided in the excerpt.
Any such corpus -- triadic or otherwise -- must justify its quality to a fine-tuning researcher through a four-tier evidence framework: mechanical verification, statistical corpus characterization, probe experiments, and pre-registered blind evaluation.
Methodological proposal in the paper outlining a four-tier evidence framework; presented as normative guidance rather than validated by application to a corpus in the excerpt.
The canonical instantiation of triadic data is two complementary products: long-horizon expert trajectories captured under stimulated-recall protocols, and simulated cross-functional companies -- instrumented teams of senior engineers, product managers, designers, and data scientists working through ambiguous deliverables on shared infrastructure.
Prescriptive specification in the paper proposing two concrete dataset types as canonical instantiations; presented as design/recommendation rather than empirically tested.
The substrate for the next generation of software-engineering (SWE) agents is neither larger GitHub scrapes nor more solo-agent trajectories nor -- sufficient by itself -- open human-AI dialogue logs; it is triadic data: synchronized capture of the human-human conversations where engineering context is formed, the human-AI sessions where that context is partially consumed, and the multi-week cross-functional work that surrounds both.
Argument and conceptual proposal in the paper; no empirical validation or comparative experiments are provided in the excerpt.
KOs transform verification economics: what was previously too costly to verify becomes feasible, enabling accumulated human validation to improve reliability over time.
Theoretical claim about economic and cumulative effects of adopting KOs; no cost-benefit analysis, pilot results, or quantitative evidence reported in the paper.
We propose Knowledge Objects (KOs) — structured artifacts that externalize implicit knowledge into forms humans can inspect, verify, and endorse.
Proposed solution described in the paper; conceptual design and intended properties presented, without reported deployments, trials, or empirical evaluation.
We release the benchmark, harness, sweep configurations, and full run corpus.
Statement of artifact release in the paper; verifiable by checking the project's repository or supplementary materials.
These findings suggest a practical design principle for agentic systems: use smaller open-weight models for the broad base of routine actions, and reserve large frontier models for the narrower class of tasks that truly demand deeper planning and control.
Synthesis/recommendation drawn from the empirical results on AgentFloor showing where small/mid models suffice and where frontier models have advantage; prescriptive claim rather than a direct empirical measurement.
The gap appears most clearly on long-horizon planning tasks that require sustained coordination and reliable constraint tracking over many steps, where frontier models still hold an advantage, though neither side reaches strong reliability.
Performance breakdown by capability tier on AgentFloor showing frontier (GPT-5) advantage on long-horizon planning/constraint-tracking tasks; both model groups have low absolute reliability on these tasks according to reported results.
We evaluate 16 open-weight models, from 0.27B to 32B parameters, alongside GPT-5 across 16,542 scored runs.
Empirical evaluation reported in the paper: 16 open-weight models spanning specified parameter sizes, inclusion of GPT-5, and a total of 16,542 scored runs (reported counts).
We introduce AgentFloor, a deterministic 30-task benchmark organized as a six-tier capability ladder, spanning instruction following, tool use, multi-step coordination, and long-horizon planning under persistent constraints.
Paper describes the design of the benchmark: deterministic, 30 tasks, organized into six tiers covering specified capabilities. This is a descriptive claim about the artifact introduced in the work.
The paper proposes five forms of online and offline issuance of RSDM, providing a prototype for creating a globally recognized modern honest money.
Authors' stated contribution in the paper (enumeration of five issuance forms and provision of a prototype); the excerpt explicitly refers to 'five forms'.
RSDM is an innovative version of Jiaozi (a deposit receipt for base metal coin that emerged in Sichuan, China, about a thousand years ago).
Comparative/analogical claim by the authors linking the proposed design to a historical instrument; no empirical analysis provided in the excerpt.
Redeemable Self-Decaying/Devaluing Money (RSDM) is a tokenized commodity money whose essential innovation is to fill the hole in the storage fee of metal coins through the self-devaluing of metal weight recorded on the deposit certificate (warehouse receipt) of metal coins.
Design/specification proposed in the paper (conceptual mechanism); no empirical evaluation or sample size reported in the excerpt.
When AI acts as an agent for cross-border capital pool and cross cyclical asset allocation, it needs a sound money that can resist the depreciation of fiat currency and store long-term value.
Theoretical argument in the paper about functional requirements of AI agents managing cross-border capital; no empirical sample reported in the excerpt.
In the AI world, however, the medium of exchange tends to be a globally recognized currency.
Author's theoretical assertion / forward-looking claim in the paper; no empirical data or sample provided in the excerpt.
Qiushi Engine performed thousands of LLM-mediated reasoning, measurement and revision actions during its investigations (e.g., 3,242 LLM calls, 1,242 tool calls).
Operational logs and activity counts reported in the paper: 145.9 million tokens, 3,242 LLM calls, 1,242 tool calls, 163 research notes, 44 scripts.
Qiushi Engine combines nonlinear research phases, Meta-Trace memory and a dual-layer architecture to maintain adaptive and stable research trajectories across long-horizon investigations.
System architecture and methods section describing nonlinear research phases, Meta-Trace memory, and dual-layer architecture; demonstrated operation across long-horizon tasks in experiments (thousands of LLM and tool calls).
The AI-discovered optical bilinear mechanism suggests a route towards high-speed, energy-efficient optical hardware for pairwise computation.
Interpretive claim based on the structural analogy between the discovered optical bilinear interaction and Transformer attention; conceptual argument provided in the paper rather than measured hardware speed or energy benchmarks.
In an open-ended study (145.9 million tokens, 3,242 LLM calls, 1,242 tool calls, 163 research notes and 44 scripts), Qiushi Engine proposes and experimentally validates an optical bilinear interaction, a physical mechanism structurally analogous to a core operation in Transformer attention.
Open-ended experimental study reported in the paper with the listed activity metrics (145.9M tokens, 3,242 LLM calls, etc.); experimental investigation and measurements presented claiming validation of optical bilinear interaction and drawing structural analogy to Transformer attention's pairwise operation.
Qiushi Engine autonomously reproduces a published transmission-matrix experiment on a non-original platform.
Experimental reproduction reported in the paper; description of executing the published transmission-matrix experiment using the Qiushi Engine on a different (non-original) optical platform and presenting measured results comparing to published experiment.
Qiushi Discovery Engine is an LLM-based agentic system for end-to-end autonomous scientific discovery on a real optical platform.
Description and implementation of the Qiushi Engine combining LLM-based agentic control with an optical experimental platform; system design and end-to-end experiments reported in the paper (no randomized trial; system demonstration).
The paper formalizes these limitations, addresses four alternative views, and proposes a co-existence solution plus a call to action for system builders, benchmark designers, and the memory community.
Meta-claim about the paper's content: formalization, rebuttals, and recommendations stated in the abstract; no empirical sample reported in abstract.
Complementary Learning Systems (CLS) theory shows biological intelligence solved this problem by pairing fast hippocampal exemplar storage with slow neocortical weight consolidation.
Appeal to established neuroscience theory (CLS); the paper draws on CLS literature to justify the two-system solution in biology; no new empirical sample reported in abstract.
Scalable synthetic computer creation, together with at-scale simulations, is highly promising as a foundational substrate for agent self-improvement and agentic reinforcement learning in long-horizon productivity scenarios.
Authors' conclusion/argument based on the methods and preliminary experimental results presented in the paper (interpretive claim rather than a quantified empirical result).
Given that personas are abundant at billion scale, this methodology can in principle scale to millions or even billions of synthetic user worlds with sufficient compute, enabling broader coverage of diverse professions, roles, contexts, environments, and productivity needs.
Argumentative/theoretical scalability claim based on the abundance of personas and the scalable design of the methodology (no empirical demonstration at millions/billions scale reported).
Each run requires over 8 hours of agent runtime and spans more than 2,000 turns on average.
Reported runtime and turn-count metrics from the preliminary experiments (per-run runtime >8 hours; per-run average >2,000 turns).
In preliminary experiments, we create 1,000 synthetic computers and run long-horizon simulations on them.
Reported preliminary experiment count in the paper (explicit statement: 1,000 synthetic computers were created and simulated).
Conditioned on each synthetic computer, we run long-horizon simulations: one agent creates productivity objectives that are specific to the computer's user and require multiple professional deliverables and about a month of human work; another agent then acts as that user and keeps working across the computer ... until these objectives are completed.
Description of the two-agent simulation procedure in the paper (simulation design: objective-creating agent and user-acting agent executing tasks across the synthetic computer).
We introduce Synthetic Computers at Scale, a scalable methodology for creating such environments with realistic folder hierarchies and content-rich artifacts (e.g., documents, spreadsheets, and presentations).
Methodological description and implementation presented in the paper (design and procedures for generating synthetic computers and artifact types).