Evidence (6491 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

We introduce unbounded cognitive fusion (UCF) as a new theoretical framework explaining coordination through cognitive synthesis rather than price signals or authority structures.

Theoretical proposal and framing within the paper; conceptual development rather than empirical validation.

high positive Beyond markets and hierarchies: How GenAI enables unbounded ... organizational coordination explained via cognitive synthesis

Generative artificial intelligence (GenAI) fundamentally alters [traditional organizational coordination] assumptions by augmenting human cognitive capabilities across organizational boundaries.

Position paper argumentation and conceptual reasoning presented in the abstract; no empirical data or sample reported.

high positive Beyond markets and hierarchies: How GenAI enables unbounded ... human cognitive capability augmentation

New tendencies in managerial AI research and practice include explainable AI, human–AI collaboration, knowledge management, enterprise analytics, and algorithmic management.

Descriptive finding from the paper's literature synthesis (topics emphasized in the review); no quantitative prevalence or counts provided in the abstract.

high positive Artificial intelligence, machine learning, and deep learning... emergent research and practice topics / adoption tendencies

Machine Learning and Deep Learning enhance employee productivity, business intelligence, process mining, and data-driven decision-making by enabling prediction, perception, and adaptive learning solutions.

Claim synthesized in the review from multiple studies identified via PRISMA screening; abstract does not list the number or identity of underlying empirical studies.

high positive Artificial intelligence, machine learning, and deep learning... employee productivity, effectiveness of business intelligence and process mining...

AI-based technologies can greatly enhance managerial efficiency by automating repetitive activities, improving resource allocation, enabling intelligent scheduling, and supporting predictive modelling and strategic planning.

Summary conclusion from the paper's literature review (PRISMA methodology referenced); no quantitative meta-analytic effect sizes provided in abstract.

high positive Artificial intelligence, machine learning, and deep learning... managerial efficiency, automation of repetitive tasks, resource allocation, sche...

Machine Learning, Artificial Intelligence, and Deep Learning are tools that can optimize managerial decisions, enable intelligent automation, streamline workflows, and improve organizational performance.

Synthesis claim from the paper's PRISMA-based literature review (no numeric sample size reported in the abstract).

high positive Artificial intelligence, machine learning, and deep learning... managerial decision quality, automation, workflow streamlining, organizational p...

Findings underscore the importance of robust evaluation frameworks for deploying VLMs in visually rich and safety-critical environments.

Synthesis/recommendation based on experimental results showing that visual inputs (images and colors) can influence VLM decisions and that mitigation effectiveness varies by model.

high positive The Effects of Visual Priming on Cooperative Behavior in Vis... need for/importance of robust evaluation frameworks for VLM safety and reliabili...

Policy frameworks, reskilling initiatives, and institutional adaptations are required to ensure inclusive technological progress.

Prescriptive conclusion presented in abstract based on the review and synthesis; no empirical validation or sample sizes provided in abstract.

high positive AI and the Transformation of Human Employment: Challenges, O... effectiveness of policy and reskilling to ensure inclusion

AI simultaneously generates demand for higher-order problem solving, emotional intelligence, and human-AI collaboration skills.

Explicit finding reported in abstract from the review of interdisciplinary literature; no quantified effect sizes or sample sizes provided in abstract.

high positive AI and the Transformation of Human Employment: Challenges, O... demand for higher-order skills / skill acquisition requirements

For memory workloads requiring stable facts and stateful computation, architecture matters more than retrieval scale or model strength alone.

Conclusion drawn by the authors based on comparative experimental results reported in the paper (xmemory vs retrieval/model-strength baselines); excerpt provides aggregate benchmark comparisons but not full experimental details.

high positive From Unstructured Recall to Schema-Grounded Memory: Reliable... relative importance of system architecture versus retrieval/model strength for m...

On the application-level task, xmemory reaches 95.2% accuracy, outperforming specialised memory systems, code-generated Markdown harnesses, and customer-facing frontier-model application harnesses.

Empirical evaluation on an application-level task reported in the paper showing 95.2% accuracy for xmemory and claiming it outperforms several classes of alternative systems; excerpt lacks details on the task, dataset size, or baseline numeric results.

high positive From Unstructured Recall to Schema-Grounded Memory: Reliable... accuracy on an application-level memory task

On the end-to-end memory benchmark, xmemory reaches 97.10% F1, compared with 80.16%-87.24% across the third-party baselines.

Empirical evaluation on the paper's end-to-end memory benchmark reporting F1 scores for xmemory and a range for third-party baselines; the excerpt does not provide dataset size or statistical significance details.

high positive From Unstructured Recall to Schema-Grounded Memory: Reliable... F1 score on an end-to-end memory benchmark

On the structured extraction benchmark (judge-in-the-loop configuration) the system reaches 90.42% object-level accuracy and 62.67% output accuracy, above all tested frontier structured-output baselines.

Empirical evaluation on the paper's structured extraction benchmark in the judge-in-the-loop configuration; the excerpt reports the numeric accuracies and states they exceed tested frontier structured-output baselines. The excerpt does not specify dataset size or number of runs.

high positive From Unstructured Recall to Schema-Grounded Memory: Reliable... object-level accuracy and output accuracy on a structured extraction benchmark

This iterative, schema-aware write-path design shifts interpretation from the read path to the write path: reads become constrained queries over verified records rather than repeated inference over retrieved prose.

Conceptual claim about how the proposed architecture affects system behavior; supported by the architectural description in the paper rather than explicit quantitative evidence in the excerpt.

high positive From Unstructured Recall to Schema-Grounded Memory: Reliable... nature of read queries (constrained queries over verified records vs repeated in...

We present an iterative, schema-aware write path that decomposes memory ingestion into object detection, field detection, and field-value extraction, with validation gates, local retries, and stateful prompt control.

Description of the proposed method/architecture in the paper (methodological contribution); no numeric evaluation attached to the description in the excerpt.

high positive From Unstructured Recall to Schema-Grounded Memory: Reliable... design/components of memory ingestion pipeline

Reliable external AI memory must be schema-grounded (schemas define what must be remembered, what may be ignored, and which values must never be inferred).

Normative assertion supported by the paper's proposed design and subsequent experimental results (the paper introduces a schema-grounded approach and evaluates it against benchmarks), though the excerpt does not give full methodological details or sample sizes for this claim alone.

high positive From Unstructured Recall to Schema-Grounded Memory: Reliable... reliability/stability of external AI memory

To manage AI legibility, creators perform four recurring forms of invisible authenticity labor: epistemic verification, linguistic naturalization, narrative restructuring, and performative embodiment.

Authors identify and name four recurrent practices from coding and analysis of 16 in-depth interviews with creators on Xiaohongshu and Douyin describing specific downstream repair and performance work.

high positive AI passing and invisible authenticity labor: trust vulnerabi... types of labor performed to conceal/humanize AI outputs

Creators engage in 'AI passing': strategic efforts to conceal and humanize AI-assisted drafts so that outputs plausibly appear human-authored.

Concept introduced based on analysis of 16 in-depth interviews with creators on Xiaohongshu and Douyin describing tactics to hide AI involvement and present content as human-authored.

high positive AI passing and invisible authenticity labor: trust vulnerabi... use of concealment/humanization strategies for AI outputs

Effective governance requires coordinated action across technical, organizational, and regulatory domains (e.g., system-level audits, vendor guidelines, continuous monitoring, documentation across dependency chains) to establish meaningful accountability in distributed development environments.

Policy and technical recommendations derived from literature review, regulatory analysis, and the paper's conceptual findings (recommendation, not empirically validated).

high positive How Supply Chain Dependencies Complicate Bias Measurement an... effectiveness of governance measures in producing meaningful accountability for ...

Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.

Conclusion/recommendation drawn from the benchmark design and experimental findings; conceptual claim advocating evaluation grounded in external demand signals and verifiable actions.

high positive Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-Wor... evaluation grounding (use of fresh external demand signals and verifiable agent ...

The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule.

Benchmark release statistics reported in the paper: explicit counts of tasks and evaluated models (105 tasks; 13 models).

high positive Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-Wor... benchmark scope (number of tasks) and evaluation breadth (number of models)

For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions.

Grading methodology described in the paper: instrumentation and hybrid deterministic/LLM-judging approach documented by authors (procedural description).

high positive Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-Wor... grading/verifiability pipeline (traces, logs, deterministic checks, structured L...

Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders.

Description of release construction in the methods: uses public workflow-demand data and ClawHub Top-500 skills; tasks are materialized with controlled fixtures and graders (procedural detail from the paper).

high positive Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-Wor... composition of benchmark releases (source signals and materialization strategy)

We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot.

Methodological contribution described in the paper; design and architecture of the benchmark are presented by the authors (design description, no external sample needed).

high positive Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-Wor... benchmark design (refreshable signal layer vs. time-stamped snapshot)

LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces.

Framing/background statement in the paper describing expected capabilities of workflow agents; no empirical sample size reported for this expectation.

high positive Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-Wor... ability to complete end-to-end units of work

The proposed framework balances AI-driven productivity with the epistemic sovereignty necessary to manage increasingly opaque software ecosystems.

Normative/architectural claim about the proposed framework; presented conceptually in the paper without reported empirical testing in the excerpt.

high positive Cognitive Atrophy and Systemic Collapse in AI-Dependent Soft... balance between productivity gains and maintenance of epistemic sovereignty (hum...

To preserve long-term resilience, engineering leaders must move beyond prompt-based development to implement rigorous human-in-the-loop pedagogical standards.

Prescriptive recommendation based on the paper's conceptual analysis; no randomized trials or empirical validation of this intervention reported in the excerpt.

high positive Cognitive Atrophy and Systemic Collapse in AI-Dependent Soft... long-term resilience of engineering organizations when using human-in-the-loop p...

The findings offer practical insights for construction firms to enhance innovation performance through effective AI integration and help engineers better leverage AI tools in design and project management workflows.

Authors' stated practical implications based on their empirical findings (survey results linking AI capability, decision-making quality, and innovation performance).

high positive AI meets engineering ingenuity: how AI capability enhances i... innovation performance

Algorithmic transparency positively moderates the relationship between AI capability and decision-making quality.

Moderation analysis reported on questionnaire data (Credamo, time-lagged) with n=435; authors state a positive moderating effect of algorithmic transparency.

high positive AI meets engineering ingenuity: how AI capability enhances i... decision-making quality

Decision-making quality mediates the relationship between AI capability and innovation performance.

Mediation analysis reported on the same survey dataset (time-lagged Credamo survey) with n=435 using established measurement scales; stated in results.

high positive AI meets engineering ingenuity: how AI capability enhances i... innovation performance

AI capability is positively associated with innovation performance.

Authors report statistical analysis of questionnaire data collected via the Credamo platform (time-lagged design) using established scales; sample size n=435; result stated in findings.

high positive AI meets engineering ingenuity: how AI capability enhances i... innovation performance

Commonly reported gains include the automation of trivial and repetitive tasks.

Multiple studies in the review report that LLM-assistants automate mundane programming tasks.

high positive The Impact of LLM-Assistants on Software Developer Productiv... automation of low-complexity tasks / developer time freed

Commonly reported gains include minimized code search due to LLM assistance.

Synthesis of study findings noting reductions in developer time spent searching for code or answers.

high positive The Impact of LLM-Assistants on Software Developer Productiv... time/effort spent searching for code or information

Commonly reported gains from LLM-assistants include accelerated development (faster task completion).

Multiple included studies report faster development workflows and reduced time-to-complete tasks, as synthesized in the review.

high positive The Impact of LLM-Assistants on Software Developer Productiv... task completion time / development speed

The majority of reviewed studies report considerable benefits from LLM-assistants.

Synthesis of findings across the 39 included peer-reviewed studies as reported in the review.

high positive The Impact of LLM-Assistants on Software Developer Productiv... overall reported impact on developer productivity

Capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.

Recommendation based on the authors' empirical deployment and analysis of failure modes and mitigation effectiveness across the end-to-end pipeline.

high positive Operating-Layer Controls for Onchain Language-Model Agents U... evaluation scope for capital-managing agents

Targeted harness changes increased capital deployment from 42.9% to 78.0% in an affected test population.

A/B or pre/post testing in an affected test population measuring percentage of capital deployed before and after harness changes.

high positive Operating-Layer Controls for Onchain Language-Model Agents U... capital deployment rate

Targeted harness changes reduced fee-led observations from 32.5% to below 10% in an affected test population.

A/B or pre/post testing in an affected test population measuring incidence of fee-led observations before and after harness changes.

high positive Operating-Layer Controls for Onchain Language-Model Agents U... incidence of fee-led observations

Targeted harness changes reduced fabricated sell rules from 57% to 3% in an affected test population.

A/B or pre/post testing in an affected test population measuring incidence of fabricated sell-rule observations before and after harness changes (percentage rates reported).

high positive Operating-Layer Controls for Onchain Language-Model Agents U... incidence of fabricated sell rules

Policy-valid submitted transactions settled with 99.9% success.

Settlement logs comparing policy-valid submitted transactions to successful onchain settlements.

high positive Operating-Layer Controls for Onchain Language-Model Agents U... settlement success rate for policy-valid submissions

We release the full code base and a richly annotated dataset to support reproducible research on adaptive VCAs.

Paper statement announcing release of code and dataset.

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... availability of codebase and annotated dataset

The recommender achieved high relevance (MRR@1=0.75).

Reported offline/online recommender evaluation in the paper using Mean Reciprocal Rank at 1 (MRR@1) metric; presumably computed over recommendations in the study (711 conversations).

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... recommendation relevance (MRR@1)

Step-by-step guidance improved pleasantness and reduced user burden.

User-reported measures collected in the controlled study (likely subjective ratings across participants/conversations).

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... pleasantness (user satisfaction) and user burden

Device-level evidence increased correct resolutions from about 50% to over 90% relative to an LLM-only baseline.

Controlled study comparing SecMate with device-level diagnostic evidence to an LLM-only baseline; reported results across 144 participants / 711 conversations.

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... correct resolutions (successful troubleshooting)

Service specificity is achieved through a proactive, context-aware recommender.

System description and recommender component evaluation in the paper.

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... use of a proactive, context-aware recommender for service specificity

User specificity relies on implicit proficiency inference and profile-aware troubleshooting.

System design and algorithmic description in the paper explaining user-proficiency inference and profile-aware components.

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... ability to infer user proficiency and use profiles for troubleshooting

Device specificity is provided by a lightweight local diagnostic utility.

System design and implementation details reported in the paper describing the diagnostic utility component.

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... presence and role of a local diagnostic utility for device specificity

We present SecMate, a multi-agent VCA for cybersecurity troubleshooting that integrates device, user, and service specificity from conversational and device-level signals.

System description and architecture presented in the paper (design and implementation of SecMate).

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... system capability to integrate device, user, and service specificity

The framework produces a list of testable empirical questions that we leave as open problems.

Statement in the paper that it derives testable empirical questions from the theoretical framework; no empirical tests are executed in the paper itself.

high positive Leverage Laws: A Per-Task Framework for Human-Agent Collabor... set of testable empirical research questions derived from the framework

The framework operationalizes aspects of earlier qualitative work on supervisory control (Sheridan, 1992), common ground (Clark & Brennan, 1991), and mixed-initiative interaction (Horvitz, 1999) within a single normative ratio.

Conceptual synthesis and mapping of prior qualitative literature into the new per-task leverage formalism presented in the paper; this is a theoretical linkage rather than empirical validation.

high positive Leverage Laws: A Per-Task Framework for Human-Agent Collabor... conceptual operationalization of supervisory control/common ground/mixed-initiat...

« Prev 1 2 3 … 66 67 68 … 129 130 Next »