Evidence (13661 claims)
Adoption
8339 claims
Productivity
7479 claims
Governance
6715 claims
Human-AI Collaboration
6267 claims
Org Design
4098 claims
Innovation
3987 claims
Labor Markets
3488 claims
Skills & Training
2888 claims
Inequality
2016 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 740 | 192 | 95 | 871 | 1945 |
| Governance & Regulation | 796 | 388 | 185 | 119 | 1512 |
| Organizational Efficiency | 765 | 186 | 123 | 82 | 1166 |
| Technology Adoption Rate | 610 | 227 | 121 | 95 | 1061 |
| Research Productivity | 409 | 121 | 56 | 331 | 928 |
| Output Quality | 464 | 174 | 58 | 47 | 743 |
| Decision Quality | 318 | 173 | 75 | 42 | 615 |
| Firm Productivity | 432 | 55 | 88 | 20 | 601 |
| AI Safety & Ethics | 214 | 273 | 65 | 33 | 589 |
| Market Structure | 175 | 165 | 120 | 24 | 489 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 161 | 57 | 57 | 16 | 291 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Fiscal & Macroeconomic | 130 | 69 | 43 | 26 | 275 |
| Employment Level | 104 | 50 | 105 | 13 | 274 |
| Consumer Welfare | 116 | 62 | 42 | 11 | 231 |
| Firm Revenue | 149 | 45 | 26 | 3 | 223 |
| Inequality Measures | 43 | 120 | 49 | 6 | 218 |
| Task Completion Time | 164 | 29 | 8 | 12 | 214 |
| Worker Satisfaction | 89 | 60 | 20 | 12 | 181 |
| Error Rate | 69 | 89 | 9 | 2 | 169 |
| Regulatory Compliance | 74 | 67 | 14 | 4 | 159 |
| Training Effectiveness | 91 | 19 | 13 | 19 | 144 |
| Wages & Compensation | 77 | 33 | 25 | 6 | 141 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Automation Exposure | 49 | 50 | 22 | 12 | 136 |
| Developer Productivity | 91 | 17 | 14 | 5 | 128 |
| Job Displacement | 12 | 80 | 19 | 1 | 112 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Skill Obsolescence | 5 | 43 | 6 | 1 | 55 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
There is an urgency to implement measures to promote digital inclusion, equitable AI development, investment in education, and international cooperation to spread the benefits of AI more widely and equitably.
Normative/recommendation in the paper based on its analysis of global disparities and risks; no policy evaluation or impact estimates provided in the excerpt.
High-income regions are pioneers in the implementation of AI.
Assertion in the paper based on cross‑regional comparison of AI implementation (no specific metrics, methods, or sample size provided in the excerpt).
High-income regions (North America, Europe, parts of the Asia-Pacific region) have virtually complete access to the Internet.
Statement in the paper based on a global comparative analysis of internet access across regions; the excerpt does not report specific data sources, methods, or sample size.
Qiushi Engine performed thousands of LLM-mediated reasoning, measurement and revision actions during its investigations (e.g., 3,242 LLM calls, 1,242 tool calls).
Operational logs and activity counts reported in the paper: 145.9 million tokens, 3,242 LLM calls, 1,242 tool calls, 163 research notes, 44 scripts.
Qiushi Engine combines nonlinear research phases, Meta-Trace memory and a dual-layer architecture to maintain adaptive and stable research trajectories across long-horizon investigations.
System architecture and methods section describing nonlinear research phases, Meta-Trace memory, and dual-layer architecture; demonstrated operation across long-horizon tasks in experiments (thousands of LLM and tool calls).
The AI-discovered optical bilinear mechanism suggests a route towards high-speed, energy-efficient optical hardware for pairwise computation.
Interpretive claim based on the structural analogy between the discovered optical bilinear interaction and Transformer attention; conceptual argument provided in the paper rather than measured hardware speed or energy benchmarks.
In an open-ended study (145.9 million tokens, 3,242 LLM calls, 1,242 tool calls, 163 research notes and 44 scripts), Qiushi Engine proposes and experimentally validates an optical bilinear interaction, a physical mechanism structurally analogous to a core operation in Transformer attention.
Open-ended experimental study reported in the paper with the listed activity metrics (145.9M tokens, 3,242 LLM calls, etc.); experimental investigation and measurements presented claiming validation of optical bilinear interaction and drawing structural analogy to Transformer attention's pairwise operation.
Qiushi Engine autonomously reproduces a published transmission-matrix experiment on a non-original platform.
Experimental reproduction reported in the paper; description of executing the published transmission-matrix experiment using the Qiushi Engine on a different (non-original) optical platform and presenting measured results comparing to published experiment.
Qiushi Discovery Engine is an LLM-based agentic system for end-to-end autonomous scientific discovery on a real optical platform.
Description and implementation of the Qiushi Engine combining LLM-based agentic control with an optical experimental platform; system design and end-to-end experiments reported in the paper (no randomized trial; system demonstration).
The practical aim is to help strategic leaders and system designers recognize the configuration at work, notice when it shifts, and judge whether it fits the decision before them.
Stated aim/objective of the paper (normative guidance; conceptual).
The framework introduces 'co-adaptability'—the capacity of a configuration to improve as human and non-human participants adjust together—and situates it within 'heterogeneous teaming' where participants may vary by number, substrate, model architecture, capability, speed, memory, and form of participation.
Conceptual/theoretical introduction of new constructs (co-adaptability and heterogeneous teaming) in the paper; definitional rather than empirical.
The five positions serve as landmarks that help leaders recognize configurations as they layer, drift, or change in a single decision.
Normative/conceptual claim supported by the framework; no empirical validation or sample provided in the excerpt.
The spectrum focuses attention on where leadership work occurs: who frames the problem, who redirects the work, and who can answer for what follows.
Conceptual argument in the paper describing the axes/criteria of the spectrum (theoretical/thematic analysis; no empirical data reported).
This paper offers a leadership-facing spectrum to see human–AI decision relationships with five positions: Pure Human, Centaur (human-dominant, with AI in the loop), Co-equal, Minotaur (AI-dominant, with humans in the loop), and Pure AI.
Conceptual presentation in the paper: a theorized five-position spectrum (no empirical sample or experiment reported).
The paper formalizes these limitations, addresses four alternative views, and proposes a co-existence solution plus a call to action for system builders, benchmark designers, and the memory community.
Meta-claim about the paper's content: formalization, rebuttals, and recommendations stated in the abstract; no empirical sample reported in abstract.
Complementary Learning Systems (CLS) theory shows biological intelligence solved this problem by pairing fast hippocampal exemplar storage with slow neocortical weight consolidation.
Appeal to established neuroscience theory (CLS); the paper draws on CLS literature to justify the two-system solution in biology; no new empirical sample reported in abstract.
AI product builders should recognize that they are designing not just model behavior but user behavior; encouraging deep engagement, rather than friction-free experiences, will lead to more success overall.
Policy/design recommendation based on the paper's analyses of 27K annotated transcripts showing links between user fluency, engagement patterns, failure visibility, recovery, and success.
Individuals should adopt a stance of active engagement rather than passive acceptance.
Interpretive recommendation derived from observed differences in outcomes by user fluency in the 27K annotated transcript analysis (paper’s discussion/recommendation section).
Fluent users' failures are more likely to lead to partial recovery.
Analysis of conversation trajectories in the 27K annotated transcripts showing higher incidence of partial recovery (follow-up iterations leading to partial fix) after failures by fluent users.
Fluent users' failures tend to be visible (a direct consequence of their engagement).
Annotations of failure visibility within the 27K transcripts, comparing frequency of visible vs. invisible failures across fluency levels.
Fluent users take on more complex tasks than novices.
Observational analysis of a richly annotated sample of 27,000 transcripts drawn from the WildChat-4.8M dataset; transcripts were annotated for user fluency and task characteristics (as reported in the paper).
Organizations should cultivate a culture of critical engagement with AI outputs, and e-leadership development must focus on building competencies in mediating, filtering and legitimizing AI contributions within digital workflows.
Recommendations based on thematic analysis of interview data across 34 project managers; presented as implications rather than empirically tested interventions.
To achieve balanced augmentation, leaders must proactively frame AI's role, embedding validation checkpoints and human authorship clauses to maintain accountability.
Prescriptive recommendation derived from thematic findings and cross-case patterns in the 34 interviews; no experimental or longitudinal testing reported.
Proactive engagement combined with creation-oriented use generated the highest effectiveness.
Qualitative coding and cross-case comparisons in the thematic analysis of 34 interviews identified combinations of proactive e-leadership and creation-oriented AI use associated with reported high team effectiveness.
The trajectory of the curvilinear relationship is governed by e-leadership practices.
Interview data analyzed thematically showing recurring references to leadership practices as moderators of AI-use effectiveness across the 34 interviews.
Based on these insights, we offer design recommendations for generative AI-powered learning tools for freelancers.
Paper contribution section — authors present design recommendations derived from study findings (not an empirical claim about an evaluated intervention).
Freelancers increasingly rely on generative AI to structure learning and support exploratory skill acquisition.
Reported finding from the paper's mixed-methods study (survey + semi-structured interviews with freelance knowledge workers).
We evaluated fidelity, calibration, cost, and gaming vulnerability of the proposed attribution approach across more than 400 configurations.
Empirical experimental section of the paper reporting evaluation across >400 model/configuration runs (paper text: 'more than 400 configurations').
Gradient-based attribution on gridded GFS analysis inputs is a viable candidate value signal for individual sensor contributions.
Experiments reported in the paper applying gradient attribution to gridded GFS analysis inputs; methodological evaluation described.
Differentiable AI weather models can be utilised to fill the gap between data-quality methods and adjoint-based data valuation, providing a practical value signal.
Methodological proposal and motivation in the paper; supported by subsequent computational experiments using differentiable AI weather models.
Large-scale IoT weather sensing networks require incentive mechanisms to sustain participation.
Position/assertion in introduction and motivation section of the paper (conceptual argument; no empirical sample reported).
Models across all three families acquire interpretable mechanical reasoning strategies without fine-tuning.
Observation reported for the three open-source models used in experiments (Llama 3.3 70B, Qwen3 4B, Qwen3 MoE 30B-A3B) showing emergent, interpretable mechanical reasoning during the iterative design process without any model fine-tuning.
The system correctly diagnoses underconstraint failure modes 35.6% of the time.
Reported diagnostic accuracy for underconstraint failure mode in the experimental results (35.6%).
The system correctly diagnoses overconstraint failure modes 56.3% of the time.
Reported diagnostic accuracy for overconstraint failure mode in the experimental results (56.3%).
78.6% of iterative refinement trajectories show measurable improvement.
Reported aggregate statistic from the experimental evaluation of iterative refinement trajectories (percentage improvement across trajectories).
The modular architecture improves structural validity by up to 134% over monolithic baselines.
Empirical results reported across six motion targets and three models comparing modular architecture to monolithic baselines; the paper reports an improvement in structural validity up to 134%.
The modular architecture reduces geometric error by up to 68% over monolithic baselines.
Empirical results reported across six engineering-relevant motion targets and three open-source models comparing the modular architecture to monolithic baselines; the paper states a maximum reduction of geometric error of 68%.
Language models can systematically improve linkage designs through symbolic representations.
Reported experiments using a modular architecture combining language-model agents and numerical optimisers across six engineering-relevant motion targets and three open-source models (Llama 3.3 70B, Qwen3 4B, Qwen3 MoE 30B-A3B); comparisons reported versus monolithic baselines.
The proposed framework emerged from operational work to improve clinician capability in a live value-based care deployment.
Stated as originating from operational experience in a live deployment; no details on deployment scale, sample size, or outcomes provided in the excerpt.
Training environments that combine longitudinal outcome measurement with aligned financial incentives are a necessary condition for learning a reward model aligned with patient trajectory rather than with encounter economics.
Normative/theoretical argument presented in the paper; no empirical tests or sample sizes reported in the excerpt.
Chronic disease management under outcome-based payment contracts produces override data with uniquely favorable properties for learning: longitudinal density, concentrated decision space, outcome labels, and natural capability variation.
Argument/claim in the paper that outcome-based contracts and chronic disease management produce favorable data characteristics; asserted as part of the framework motivation. No quantitative empirical evidence or sample sizes provided in the excerpt.
We propose a dual learning architecture that jointly trains a reward model and a capability model via alternating optimization, which prevents a failure mode we term 'suppression bias'—the systematic suppression of correct-but-difficult recommendations when clinician capability falls below the execution threshold.
Proposed algorithmic contribution and theoretical claim; suppression bias defined and a mitigation approach described. No empirical evaluation or sample sizes given in the excerpt.
We formulate preferences conditioned on patient state s, organizational context c, and clinician capability κ, where κ decomposes into execution capability (κ-exec) and alignment capability (κ-align).
Presented as a formal model formulation in the paper; theoretical description without empirical sample sizes in the excerpt.
We introduce a five-category override taxonomy that maps override types to distinct model update targets.
Stated as a formal contribution of the framework; taxonomy proposed in the paper. No empirical validation or sample size reported in the excerpt.
Clinician overrides of clinical AI recommendations can be reframed as implicit preference data analogous to reinforcement learning from human feedback (RLHF), but richer because the annotator is a domain expert, the alternatives carry real consequences, and downstream outcomes are observable.
Conceptual argument presented in the paper drawing an analogy to RLHF; no empirical metrics or sample size reported in the excerpt.
Scalable synthetic computer creation, together with at-scale simulations, is highly promising as a foundational substrate for agent self-improvement and agentic reinforcement learning in long-horizon productivity scenarios.
Authors' conclusion/argument based on the methods and preliminary experimental results presented in the paper (interpretive claim rather than a quantified empirical result).
Given that personas are abundant at billion scale, this methodology can in principle scale to millions or even billions of synthetic user worlds with sufficient compute, enabling broader coverage of diverse professions, roles, contexts, environments, and productivity needs.
Argumentative/theoretical scalability claim based on the abundance of personas and the scalable design of the methodology (no empirical demonstration at millions/billions scale reported).
Each run requires over 8 hours of agent runtime and spans more than 2,000 turns on average.
Reported runtime and turn-count metrics from the preliminary experiments (per-run runtime >8 hours; per-run average >2,000 turns).
In preliminary experiments, we create 1,000 synthetic computers and run long-horizon simulations on them.
Reported preliminary experiment count in the paper (explicit statement: 1,000 synthetic computers were created and simulated).
Conditioned on each synthetic computer, we run long-horizon simulations: one agent creates productivity objectives that are specific to the computer's user and require multiple professional deliverables and about a month of human work; another agent then acts as that user and keeps working across the computer ... until these objectives are completed.
Description of the two-agent simulation procedure in the paper (simulation design: objective-creating agent and user-acting agent executing tasks across the synthetic computer).