Evidence (13661 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	740	192	95	871	1945
Governance & Regulation	796	388	185	119	1512
Organizational Efficiency	765	186	123	82	1166
Technology Adoption Rate	610	227	121	95	1061
Research Productivity	409	121	56	331	928
Output Quality	464	174	58	47	743
Decision Quality	318	173	75	42	615
Firm Productivity	432	55	88	20	601
AI Safety & Ethics	214	273	65	33	589
Market Structure	175	165	120	24	489
Task Allocation	206	64	70	31	376
Skill Acquisition	161	57	57	16	291
Innovation Output	201	27	41	18	288
Fiscal & Macroeconomic	130	69	43	26	275
Employment Level	104	50	105	13	274
Consumer Welfare	116	62	42	11	231
Firm Revenue	149	45	26	3	223
Inequality Measures	43	120	49	6	218
Task Completion Time	164	29	8	12	214
Worker Satisfaction	89	60	20	12	181
Error Rate	69	89	9	2	169
Regulatory Compliance	74	67	14	4	159
Training Effectiveness	91	19	13	19	144
Wages & Compensation	77	33	25	6	141
Team Performance	86	17	27	9	140
Automation Exposure	49	50	22	12	136
Developer Productivity	91	17	14	5	128
Job Displacement	12	80	19	1	112
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	16	7	2	57
Skill Obsolescence	5	43	6	1	55
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

There is an urgency to implement measures to promote digital inclusion, equitable AI development, investment in education, and international cooperation to spread the benefits of AI more widely and equitably.

Normative/recommendation in the paper based on its analysis of global disparities and risks; no policy evaluation or impact estimates provided in the excerpt.

high positive GLOBAL DISPROPORTIONS IN THE IMPLEMENTATION AND USE OF ARTIF... policy interventions for digital inclusion and equitable AI distribution

High-income regions are pioneers in the implementation of AI.

Assertion in the paper based on cross‑regional comparison of AI implementation (no specific metrics, methods, or sample size provided in the excerpt).

high positive GLOBAL DISPROPORTIONS IN THE IMPLEMENTATION AND USE OF ARTIF... AI implementation/adoption

High-income regions (North America, Europe, parts of the Asia-Pacific region) have virtually complete access to the Internet.

Statement in the paper based on a global comparative analysis of internet access across regions; the excerpt does not report specific data sources, methods, or sample size.

high positive GLOBAL DISPROPORTIONS IN THE IMPLEMENTATION AND USE OF ARTIF... Internet access rates

Qiushi Engine performed thousands of LLM-mediated reasoning, measurement and revision actions during its investigations (e.g., 3,242 LLM calls, 1,242 tool calls).

Operational logs and activity counts reported in the paper: 145.9 million tokens, 3,242 LLM calls, 1,242 tool calls, 163 research notes, 44 scripts.

high positive End-to-end autonomous scientific discovery on a real optical... scale of automated research activity (counts of LLM calls, tool calls, notes, sc...

Qiushi Engine combines nonlinear research phases, Meta-Trace memory and a dual-layer architecture to maintain adaptive and stable research trajectories across long-horizon investigations.

System architecture and methods section describing nonlinear research phases, Meta-Trace memory, and dual-layer architecture; demonstrated operation across long-horizon tasks in experiments (thousands of LLM and tool calls).

high positive End-to-end autonomous scientific discovery on a real optical... ability to maintain adaptive and stable research trajectories over long-horizon ...

The AI-discovered optical bilinear mechanism suggests a route towards high-speed, energy-efficient optical hardware for pairwise computation.

Interpretive claim based on the structural analogy between the discovered optical bilinear interaction and Transformer attention; conceptual argument provided in the paper rather than measured hardware speed or energy benchmarks.

high positive End-to-end autonomous scientific discovery on a real optical... potential for high-speed, energy-efficient optical hardware (conceptual implicat...

In an open-ended study (145.9 million tokens, 3,242 LLM calls, 1,242 tool calls, 163 research notes and 44 scripts), Qiushi Engine proposes and experimentally validates an optical bilinear interaction, a physical mechanism structurally analogous to a core operation in Transformer attention.

Open-ended experimental study reported in the paper with the listed activity metrics (145.9M tokens, 3,242 LLM calls, etc.); experimental investigation and measurements presented claiming validation of optical bilinear interaction and drawing structural analogy to Transformer attention's pairwise operation.

high positive End-to-end autonomous scientific discovery on a real optical... experimental validation of an optical bilinear interaction mechanism

Qiushi Engine autonomously reproduces a published transmission-matrix experiment on a non-original platform.

Experimental reproduction reported in the paper; description of executing the published transmission-matrix experiment using the Qiushi Engine on a different (non-original) optical platform and presenting measured results comparing to published experiment.

high positive End-to-end autonomous scientific discovery on a real optical... successful reproduction of a published transmission-matrix experiment (experimen...

Qiushi Discovery Engine is an LLM-based agentic system for end-to-end autonomous scientific discovery on a real optical platform.

Description and implementation of the Qiushi Engine combining LLM-based agentic control with an optical experimental platform; system design and end-to-end experiments reported in the paper (no randomized trial; system demonstration).

high positive End-to-end autonomous scientific discovery on a real optical... existence and operation of an end-to-end autonomous LLM-driven discovery system ...

The practical aim is to help strategic leaders and system designers recognize the configuration at work, notice when it shifts, and judge whether it fits the decision before them.

Stated aim/objective of the paper (normative guidance; conceptual).

high positive Leading Across the Spectrum of Human-AI Relationships: A Con... leaders' capacity to detect configuration, detect shifts, and assess fitness of ...

The framework introduces 'co-adaptability'—the capacity of a configuration to improve as human and non-human participants adjust together—and situates it within 'heterogeneous teaming' where participants may vary by number, substrate, model architecture, capability, speed, memory, and form of participation.

Conceptual/theoretical introduction of new constructs (co-adaptability and heterogeneous teaming) in the paper; definitional rather than empirical.

high positive Leading Across the Spectrum of Human-AI Relationships: A Con... capacity for joint improvement through adaptation between human and AI participa...

The five positions serve as landmarks that help leaders recognize configurations as they layer, drift, or change in a single decision.

Normative/conceptual claim supported by the framework; no empirical validation or sample provided in the excerpt.

high positive Leading Across the Spectrum of Human-AI Relationships: A Con... leaders' ability to recognize shifting decision configurations

The spectrum focuses attention on where leadership work occurs: who frames the problem, who redirects the work, and who can answer for what follows.

Conceptual argument in the paper describing the axes/criteria of the spectrum (theoretical/thematic analysis; no empirical data reported).

high positive Leading Across the Spectrum of Human-AI Relationships: A Con... allocation of leadership activities (framing, redirecting, accountability) in hu...

This paper offers a leadership-facing spectrum to see human–AI decision relationships with five positions: Pure Human, Centaur (human-dominant, with AI in the loop), Co-equal, Minotaur (AI-dominant, with humans in the loop), and Pure AI.

Conceptual presentation in the paper: a theorized five-position spectrum (no empirical sample or experiment reported).

high positive Leading Across the Spectrum of Human-AI Relationships: A Con... presence of a conceptual spectrum for classifying human–AI decision configuratio...

The paper formalizes these limitations, addresses four alternative views, and proposes a co-existence solution plus a call to action for system builders, benchmark designers, and the memory community.

Meta-claim about the paper's content: formalization, rebuttals, and recommendations stated in the abstract; no empirical sample reported in abstract.

high positive Contextual Agentic Memory is a Memo, Not True Memory proposed research and design agenda (co-existence of lookup and weight-based mem...

Complementary Learning Systems (CLS) theory shows biological intelligence solved this problem by pairing fast hippocampal exemplar storage with slow neocortical weight consolidation.

Appeal to established neuroscience theory (CLS); the paper draws on CLS literature to justify the two-system solution in biology; no new empirical sample reported in abstract.

high positive Contextual Agentic Memory is a Memo, Not True Memory memory architecture in biological intelligence (hippocampus + neocortex)

AI product builders should recognize that they are designing not just model behavior but user behavior; encouraging deep engagement, rather than friction-free experiences, will lead to more success overall.

Policy/design recommendation based on the paper's analyses of 27K annotated transcripts showing links between user fluency, engagement patterns, failure visibility, recovery, and success.

high positive A paradox of AI fluency product design recommendation (encouraging deep engagement)

Individuals should adopt a stance of active engagement rather than passive acceptance.

Interpretive recommendation derived from observed differences in outcomes by user fluency in the 27K annotated transcript analysis (paper’s discussion/recommendation section).

high positive A paradox of AI fluency recommended user behavior (active engagement)

Fluent users' failures are more likely to lead to partial recovery.

Analysis of conversation trajectories in the 27K annotated transcripts showing higher incidence of partial recovery (follow-up iterations leading to partial fix) after failures by fluent users.

high positive A paradox of AI fluency partial recovery rate after failures

Fluent users' failures tend to be visible (a direct consequence of their engagement).

Annotations of failure visibility within the 27K transcripts, comparing frequency of visible vs. invisible failures across fluency levels.

high positive A paradox of AI fluency visibility of failures (visible vs. invisible failures)

Fluent users take on more complex tasks than novices.

Observational analysis of a richly annotated sample of 27,000 transcripts drawn from the WildChat-4.8M dataset; transcripts were annotated for user fluency and task characteristics (as reported in the paper).

high positive A paradox of AI fluency task complexity

Organizations should cultivate a culture of critical engagement with AI outputs, and e-leadership development must focus on building competencies in mediating, filtering and legitimizing AI contributions within digital workflows.

Recommendations based on thematic analysis of interview data across 34 project managers; presented as implications rather than empirically tested interventions.

high positive E-leadership and human-AI collaboration: socio-technical ali... organizational practices / e-leadership competencies (intended to improve team/o...

To achieve balanced augmentation, leaders must proactively frame AI's role, embedding validation checkpoints and human authorship clauses to maintain accountability.

Prescriptive recommendation derived from thematic findings and cross-case patterns in the 34 interviews; no experimental or longitudinal testing reported.

high positive E-leadership and human-AI collaboration: socio-technical ali... accountability / balanced augmentation (implied improvement in team effectivenes...

Proactive engagement combined with creation-oriented use generated the highest effectiveness.

Qualitative coding and cross-case comparisons in the thematic analysis of 34 interviews identified combinations of proactive e-leadership and creation-oriented AI use associated with reported high team effectiveness.

high positive E-leadership and human-AI collaboration: socio-technical ali... perceived team effectiveness

The trajectory of the curvilinear relationship is governed by e-leadership practices.

Interview data analyzed thematically showing recurring references to leadership practices as moderators of AI-use effectiveness across the 34 interviews.

high positive E-leadership and human-AI collaboration: socio-technical ali... perceived team effectiveness (as moderated by e-leadership)

Based on these insights, we offer design recommendations for generative AI-powered learning tools for freelancers.

Paper contribution section — authors present design recommendations derived from study findings (not an empirical claim about an evaluated intervention).

high positive Upskilling with Generative AI: Practices and Challenges for ... design guidance intended to improve generative AI learning tool suitability/effe...

Freelancers increasingly rely on generative AI to structure learning and support exploratory skill acquisition.

Reported finding from the paper's mixed-methods study (survey + semi-structured interviews with freelance knowledge workers).

high positive Upskilling with Generative AI: Practices and Challenges for ... use of generative AI tools for structuring learning and exploratory skill acquis...

We evaluated fidelity, calibration, cost, and gaming vulnerability of the proposed attribution approach across more than 400 configurations.

Empirical experimental section of the paper reporting evaluation across >400 model/configuration runs (paper text: 'more than 400 configurations').

high positive Calibrating Attribution Proxies for Reward Allocation in Par... fidelity, calibration, computational cost, and vulnerability to gaming of attrib...

Gradient-based attribution on gridded GFS analysis inputs is a viable candidate value signal for individual sensor contributions.

Experiments reported in the paper applying gradient attribution to gridded GFS analysis inputs; methodological evaluation described.

high positive Calibrating Attribution Proxies for Reward Allocation in Par... suitability of gradient-based attribution as a value signal

Differentiable AI weather models can be utilised to fill the gap between data-quality methods and adjoint-based data valuation, providing a practical value signal.

Methodological proposal and motivation in the paper; supported by subsequent computational experiments using differentiable AI weather models.

high positive Calibrating Attribution Proxies for Reward Allocation in Par... feasibility of using differentiable AI models as a data valuation mechanism

Large-scale IoT weather sensing networks require incentive mechanisms to sustain participation.

Position/assertion in introduction and motivation section of the paper (conceptual argument; no empirical sample reported).

high positive Calibrating Attribution Proxies for Reward Allocation in Par... need for incentive mechanisms to sustain participation

Models across all three families acquire interpretable mechanical reasoning strategies without fine-tuning.

Observation reported for the three open-source models used in experiments (Llama 3.3 70B, Qwen3 4B, Qwen3 MoE 30B-A3B) showing emergent, interpretable mechanical reasoning during the iterative design process without any model fine-tuning.

high positive Language Models Refine Mechanical Linkage Designs Through Sy... acquisition of interpretable mechanical reasoning strategies

The system correctly diagnoses underconstraint failure modes 35.6% of the time.

Reported diagnostic accuracy for underconstraint failure mode in the experimental results (35.6%).

high positive Language Models Refine Mechanical Linkage Designs Through Sy... accuracy in diagnosing underconstraint failure mode

The system correctly diagnoses overconstraint failure modes 56.3% of the time.

Reported diagnostic accuracy for overconstraint failure mode in the experimental results (56.3%).

high positive Language Models Refine Mechanical Linkage Designs Through Sy... accuracy in diagnosing overconstraint failure mode

78.6% of iterative refinement trajectories show measurable improvement.

Reported aggregate statistic from the experimental evaluation of iterative refinement trajectories (percentage improvement across trajectories).

high positive Language Models Refine Mechanical Linkage Designs Through Sy... presence of measurable improvement across iterative refinement trajectories

The modular architecture improves structural validity by up to 134% over monolithic baselines.

Empirical results reported across six motion targets and three models comparing modular architecture to monolithic baselines; the paper reports an improvement in structural validity up to 134%.

high positive Language Models Refine Mechanical Linkage Designs Through Sy... structural validity of linkage designs

The modular architecture reduces geometric error by up to 68% over monolithic baselines.

Empirical results reported across six engineering-relevant motion targets and three open-source models comparing the modular architecture to monolithic baselines; the paper states a maximum reduction of geometric error of 68%.

high positive Language Models Refine Mechanical Linkage Designs Through Sy... geometric error

Language models can systematically improve linkage designs through symbolic representations.

Reported experiments using a modular architecture combining language-model agents and numerical optimisers across six engineering-relevant motion targets and three open-source models (Llama 3.3 70B, Qwen3 4B, Qwen3 MoE 30B-A3B); comparisons reported versus monolithic baselines.

high positive Language Models Refine Mechanical Linkage Designs Through Sy... quality of linkage designs (geometric error, structural validity)

The proposed framework emerged from operational work to improve clinician capability in a live value-based care deployment.

Stated as originating from operational experience in a live deployment; no details on deployment scale, sample size, or outcomes provided in the excerpt.

high positive Learning from Disagreement: Clinician Overrides as Implicit ... improvement of clinician capability through operational application of the frame...

Training environments that combine longitudinal outcome measurement with aligned financial incentives are a necessary condition for learning a reward model aligned with patient trajectory rather than with encounter economics.

Normative/theoretical argument presented in the paper; no empirical tests or sample sizes reported in the excerpt.

high positive Learning from Disagreement: Clinician Overrides as Implicit ... alignment of learned reward model to patient trajectory versus encounter-level i...

Chronic disease management under outcome-based payment contracts produces override data with uniquely favorable properties for learning: longitudinal density, concentrated decision space, outcome labels, and natural capability variation.

Argument/claim in the paper that outcome-based contracts and chronic disease management produce favorable data characteristics; asserted as part of the framework motivation. No quantitative empirical evidence or sample sizes provided in the excerpt.

high positive Learning from Disagreement: Clinician Overrides as Implicit ... suitability of collected override data for training outcome-aligned reward model...

We propose a dual learning architecture that jointly trains a reward model and a capability model via alternating optimization, which prevents a failure mode we term 'suppression bias'—the systematic suppression of correct-but-difficult recommendations when clinician capability falls below the execution threshold.

Proposed algorithmic contribution and theoretical claim; suppression bias defined and a mitigation approach described. No empirical evaluation or sample sizes given in the excerpt.

high positive Learning from Disagreement: Clinician Overrides as Implicit ... reduction or prevention of suppression bias in learned recommendations

We formulate preferences conditioned on patient state s, organizational context c, and clinician capability κ, where κ decomposes into execution capability (κ-exec) and alignment capability (κ-align).

Presented as a formal model formulation in the paper; theoretical description without empirical sample sizes in the excerpt.

high positive Learning from Disagreement: Clinician Overrides as Implicit ... representational fidelity of preference model to contextual factors (patient, or...

We introduce a five-category override taxonomy that maps override types to distinct model update targets.

Stated as a formal contribution of the framework; taxonomy proposed in the paper. No empirical validation or sample size reported in the excerpt.

high positive Learning from Disagreement: Clinician Overrides as Implicit ... categorization of clinician overrides to inform model updates

Clinician overrides of clinical AI recommendations can be reframed as implicit preference data analogous to reinforcement learning from human feedback (RLHF), but richer because the annotator is a domain expert, the alternatives carry real consequences, and downstream outcomes are observable.

Conceptual argument presented in the paper drawing an analogy to RLHF; no empirical metrics or sample size reported in the excerpt.

high positive Learning from Disagreement: Clinician Overrides as Implicit ... quality of preference signal available for learning reward models from clinician...

Scalable synthetic computer creation, together with at-scale simulations, is highly promising as a foundational substrate for agent self-improvement and agentic reinforcement learning in long-horizon productivity scenarios.

Authors' conclusion/argument based on the methods and preliminary experimental results presented in the paper (interpretive claim rather than a quantified empirical result).

high positive Synthetic Computers at Scale for Long-Horizon Productivity S... suitability as a substrate for agent self-improvement and agentic RL

Given that personas are abundant at billion scale, this methodology can in principle scale to millions or even billions of synthetic user worlds with sufficient compute, enabling broader coverage of diverse professions, roles, contexts, environments, and productivity needs.

Argumentative/theoretical scalability claim based on the abundance of personas and the scalable design of the methodology (no empirical demonstration at millions/billions scale reported).

high positive Synthetic Computers at Scale for Long-Horizon Productivity S... scalability potential (number of synthetic user worlds producible)

Each run requires over 8 hours of agent runtime and spans more than 2,000 turns on average.

Reported runtime and turn-count metrics from the preliminary experiments (per-run runtime >8 hours; per-run average >2,000 turns).

high positive Synthetic Computers at Scale for Long-Horizon Productivity S... agent runtime per simulation run; number of turns per run

In preliminary experiments, we create 1,000 synthetic computers and run long-horizon simulations on them.

Reported preliminary experiment count in the paper (explicit statement: 1,000 synthetic computers were created and simulated).

high positive Synthetic Computers at Scale for Long-Horizon Productivity S... number of synthetic computers created and simulated

Conditioned on each synthetic computer, we run long-horizon simulations: one agent creates productivity objectives that are specific to the computer's user and require multiple professional deliverables and about a month of human work; another agent then acts as that user and keeps working across the computer ... until these objectives are completed.

Description of the two-agent simulation procedure in the paper (simulation design: objective-creating agent and user-acting agent executing tasks across the synthetic computer).

high positive Synthetic Computers at Scale for Long-Horizon Productivity S... ability to simulate long-horizon, user-conditioned productivity workflows

« Prev 1 2 3 … 124 125 126 … 273 274 Next »