Evidence (5192 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	738	1617
Governance & Regulation	671	334	160	99	1285
Organizational Efficiency	626	147	105	70	955
Technology Adoption Rate	502	176	98	78	861
Research Productivity	349	109	48	322	838
Output Quality	391	121	45	40	597
Firm Productivity	385	46	85	17	539
Decision Quality	277	145	63	34	526
AI Safety & Ethics	189	244	59	30	526
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	106	40	6	188
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	79	8	1	152
Regulatory Compliance	69	66	14	3	152
Training Effectiveness	82	16	13	18	131
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

Robust resilience stems from 'bounded autonomy': constraining what an AI may decide and when humans must intervene.

Normative proposal grounded in synthesis of safety standards, crisis-management practices, and conceptual arguments; specification of autonomy dimensions (authority scope, temporal limits, performance envelopes, fail-safes).

medium positive Resilience Meets Autonomy: Governing Embodied AI in Critical... system resilience metrics (ability to avoid cascades, graceful degradation, cont...

Human–AI chat logs contain more explicit strategy commitments (stated rules) than human–human chats.

Content analysis / coding of natural-language chat logs from the human–AI experiment (human–AI n = 126) and the human–human benchmark (n = 108); coding counts show higher frequency of explicit commitments/statements of rules in human–AI messages.

medium positive Playing Against the Machine: Cooperation, Communication, and... frequency/count of explicit strategy-commitment messages in chat logs

Human–human subjects converge to Tit‑for‑Tat under one condition and to unconditional cooperation under the repeated-communication condition.

Strategy-estimation and behavioral trajectory analysis from the human–human benchmark (Dvorak & Fehrler 2024; n = 108) reported in the paper, showing condition-dependent convergence to Tit‑for‑Tat and to unconditional cooperation under repeated communication.

medium positive Playing Against the Machine: Cooperation, Communication, and... prevalent strategy type over time in human–human pairs (Tit‑for‑Tat vs unconditi...

Strategy estimation indicates human–AI subjects tend to favor Grim Trigger when allowed pre-play communication.

Strategy-estimation/classification applied to subjects' choices in the human–AI condition with pre-play chat (subset of the human–AI n = 126); inferred strategy prevalence shows elevated assignment to Grim Trigger-type rules.

medium positive Playing Against the Machine: Cooperation, Communication, and... prevalence/frequency of Grim Trigger strategy classification among subjects

Version 1.0 marks integration into operational workflows and establishes a base for future capabilities.

Authors report that v1.0 has been used in verification and mask-refinement loops for real datasets (MeerKAT, ASKAP, APERTIF); no detailed deployment metrics provided.

medium positive iDaVIE v1.0: A virtual reality tool for interactive analysis... operational integration status of v1.0

Immersive inspection tools like iDaVIE are complements to automated ML pipelines by helping generate higher-quality labels and curated training examples.

Paper argues conceptual complementarity and cites iDaVIE's use for mask refinement and curated subcube export; no experimental comparison of label quality or downstream ML performance provided.

medium positive iDaVIE v1.0: A virtual reality tool for interactive analysis... label quality and availability of curated training examples

iDaVIE accelerates inspection-driven parts of astronomy workflows (e.g., mask refinement, verification).

Reported use cases where iDaVIE was used to refine masks and verify sources in real datasets; no measured time-per-task or throughput statistics provided.

medium positive iDaVIE v1.0: A virtual reality tool for interactive analysis... inspection throughput (time per cube inspected; masks corrected per hour)

iDaVIE has already been integrated into real pipelines (MeerKAT, ASKAP, APERTIF) and used to improve quality control, refine detection masks, and identify new sources.

Author statement of integration and use cases citing verification of HI data cubes from MeerKAT, ASKAP and APERTIF; no quantitative deployment counts or independent validation provided in the text.

medium positive iDaVIE v1.0: A virtual reality tool for interactive analysis... integration into operational data-reduction/verification workflows; effects on Q...

There is a need for policies supporting workforce transitions (retraining, portability of skills) and safety/regulation for embodied agents operating in public spaces.

Policy recommendation grounded in anticipated labor and safety risks; proposed but not empirically evaluated.

medium positive Why AI systems don't learn and what to do about it: Lessons ... policy adoption; retraining program coverage; safety/regulatory frameworks imple...

Benchmarks and tasks that mix observation and intervention (imitation with sparse feedback, active imitation, transfer under domain shift, continual learning streams) are required to evaluate the architecture.

Proposal for evaluation tasks and benchmarks; not empirically validated in the paper.

medium positive Why AI systems don't learn and what to do about it: Lessons ... benchmark performance on mixed observation-intervention tasks

Embodied robotics experiments are necessary to evaluate real-world constraints such as sample efficiency, physical affordances, and motor learning.

Methodological recommendation recognizing simulation-to-real gaps; no experiments reported.

medium positive Why AI systems don't learn and what to do about it: Lessons ... sample efficiency and performance in real-world embodied tasks

Simulated environments (procedural, nonstationary), multi-agent social domains, and open-world 3D simulators are appropriate for scalable iteration to test the proposed architecture.

Methodological recommendation and suggested experimental approaches; not tested in the paper.

medium positive Why AI systems don't learn and what to do about it: Lessons ... suitability and scalability of simulation platforms for architecture evaluation

Neuromodulatory systems and meta-decision circuits in animals provide analogies for implementing meta-control (M) in artificial systems.

Neuroscience analogy cited to motivate architectural choices; not empirically instantiated in the paper.

medium positive Why AI systems don't learn and what to do about it: Lessons ... effectiveness of biologically inspired gating/plasticity mechanisms on learning ...

Developmental trajectories can scaffold gradual competence (from observation to exploratory action) and should be reflected in training curricula.

Argument from developmental biology and learning theory; proposed as a design principle rather than empirically tested here.

medium positive Why AI systems don't learn and what to do about it: Lessons ... learning progression speed; final competence given staged curricula

Evolution supplies inductive biases and slow structural priors that can be leveraged in artificial learners.

Biological analogy and theoretical suggestion; no empirical experiments presented to quantify effect in AI systems.

medium positive Why AI systems don't learn and what to do about it: Lessons ... effect of structural priors on learning speed and generalization

The taxonomy and measurement approach provide operational metrics to quantify empathic communication for economic analyses (productivity, customer satisfaction, retention).

Authors propose that their data-driven taxonomy and automated/coding measures can be used as metrics; the paper demonstrates derivation and use in trial outcomes but does not present direct economic outcome measurements.

medium positive Practicing with Language Models Cultivates Human Empathic Co... operational empathic communication metrics (taxonomy-derived measures)

LLM-generated responses frequently score as more empathic than human-written responses in blinded evaluations.

Blinded evaluations comparing LLM-generated replies to human-written replies using recipient/judge ratings of perceived empathy (reported in blinded tests described in paper). Exact blinded-test sample sizes not specified in the summary but derived from the study's evaluation procedures.

medium positive Practicing with Language Models Cultivates Human Empathic Co... blinded empathy judgments (perceived empathy ratings)

LLMs are more likely to complement human tacit skills than to replace explicit rule‑following jobs; value accrues to workers and firms that integrate model outputs with human judgment and tacit expertise.

Labor‑economics style argument and theoretical reasoning; no empirical labor market analysis provided.

medium positive Why the Valuable Capabilities of LLMs Are Precisely the Unex... complementarity vs substitution of human labor (especially tacit-skill jobs)

Commoditization via rule extraction is limited; firms that can harness and deploy tacit LLM capabilities will retain economic rents.

Theoretical economic argument based on non‑rule‑encodability; no empirical firm‑level data included.

medium positive Why the Valuable Capabilities of LLMs Are Precisely the Unex... ability to commoditize/replicate LLM capabilities via rule extraction

The highest‑value attributes of LLMs may be inherently non‑decomposable into simple, auditable rules, which increases the value of proprietary, black‑box models and strengthens economies of scale and scope for large model providers.

Economic reasoning and theoretical implications drawn from the central thesis; no empirical market analyses provided.

medium positive Why the Valuable Capabilities of LLMs Are Precisely the Unex... value capture by model providers (proprietary rents/economies of scale)

Some LLM capabilities are tacit, practice‑derived, or 'insight'‑like, akin to the Chinese concept of Wu (sudden insight through practiced skill).

Philosophical framing and analogy to the concept of tacit knowledge (Wu); argumentative rather than empirical support.

medium positive Why the Valuable Capabilities of LLMs Are Precisely the Unex... characterization of LLM competence as tacit/insight-like

The economically valuable capabilities of large language models are precisely those that cannot be fully encoded as a complete, human‑readable set of discrete rules.

Formal, conceptual argument (proof by contradiction) plus qualitative historical case analysis comparing expert systems and LLMs; no new empirical datasets or experiments reported.

medium positive Why the Valuable Capabilities of LLMs Are Precisely the Unex... economic value / capability of LLMs (degree of rule‑encodability vs tacitness)

Open dataset and code improve reproducibility and lower barriers for follow-up work on applied LLM tools and economic impact studies.

Release of SlideRL dataset (288 rollouts) and code repository; general statement about reproducibility benefits.

medium positive Learning to Present: Inverse Specification Rewards for Agent... Availability of artifacts that can be used to reproduce/extend the work

Parameter-efficient RL fine-tuning (0.5% of params) can yield large quality gains, implying a potentially high ROI for targeted fine-tuning versus full-model scaling.

Observed empirical gain of +33.1% for the tuned 7B over its untuned base and the 91.2% relative performance vs Claude Opus 4.6; implication drawn about cost-effectiveness of tuning few parameters rather than scaling model size.

medium positive Learning to Present: Inverse Specification Rewards for Agent... Quality gains after parameter-efficient fine-tuning and implied cost-effectivene...

The inverse-specification reward—where an LLM attempts to recover the original brief from generated slides—provides a holistic fidelity signal.

Reward design: inverse-specification component implemented and used as part of composite reward; claimed to measure fidelity via recovery accuracy.

medium positive Learning to Present: Inverse Specification Rewards for Agent... Accuracy of recovering original brief from generated slides (used as fidelity si...

Performance on this agentic slide-generation task is driven more by instruction adherence and tool-use compliance than by raw model parameter count.

Cross-model comparison across six models on the 48-task benchmark, with analyses showing instruction adherence and tool-use compliance better predict agent performance than parameter count.

medium positive Learning to Present: Inverse Specification Rewards for Agent... Predictive strength (correlation/importance) of instruction adherence and tool-u...

The proposed algorithm's performance is robust to heterogeneous populations in the synthetic experiments (i.e., it continues to find core alternatives under varying degrees of population heterogeneity).

Empirical robustness checks reported in the experiments where population heterogeneity is varied and performance (core-attainment frequency) is evaluated.

medium positive Finding Common Ground in a Sea of Alternatives frequency/proportion of core outcomes as a function of population heterogeneity

The authors compare their sampling algorithm against classical social-choice rules and LLM-based heuristics and report superior core-attainment frequency for their method.

Experimental comparisons described in the paper between the proposed algorithm and baseline methods (classical social-choice rules, LLM-based heuristics) on the synthetic dataset; results summarized in the experiments section.

medium positive Finding Common Ground in a Sea of Alternatives relative frequency/proportion of outputs that lie in the proportional veto core ...

On a synthetic text-preference dataset, the proposed algorithm reliably finds alternatives that lie in the proportional veto core.

Empirical experiments reported in the paper using a synthetic dataset of text preferences; evaluation metric reported as frequency (proportion) of runs where the returned alternative is in the proportional veto core.

medium positive Finding Common Ground in a Sea of Alternatives frequency/proportion of experimental trials producing outcomes in the proportion...

Temporal grounding (restricting models to contemporaneous information) should be adopted as a methodological best practice in economic research using LLMs to avoid leakage and produce more realistic assessments of model forecasting ability.

Study methodology and rationale emphasize temporal grounding; authors recommend it as best practice based on the observed benefits in reducing retrospective contamination.

medium positive When AI Navigates the Fog of War recommended methodological practice adoption (procedural recommendation)

Because the conflict unfolded after the training cutoffs of contemporary frontier LLMs, the dataset and analyses provide an archival, hindsight-free benchmark for studying model reasoning.

Case selection rationale: the 2026 Middle East conflict was deliberately chosen because it occurred after the training cutoffs of the evaluated frontier models; dataset preserves contemporaneous queries and model outputs.

medium positive When AI Navigates the Fog of War availability of a hindsight-free archival benchmark (dataset existence and timin...

Frontier large language models (LLMs) can reason about an unfolding geopolitical crisis using only contemporaneous public information, often demonstrating strategic realism (inferring underlying structural incentives beyond surface rhetoric).

Evaluation across 11 temporally defined nodes during the early 2026 Middle East conflict using 42 node-specific verifiable questions and 5 exploratory prompts; results assessed via verifiability checks and qualitative coding for strategic reasoning of outputs from contemporary frontier LLMs constrained to contemporaneous information.

medium positive When AI Navigates the Fog of War reasoning quality / frequency of responses exhibiting strategic realism (qualita...

Legible decision modes and recorded contest pathways improve verifiability and lower information asymmetries, aiding regulators and platforms in monitoring and reducing litigation/reputational risk.

Analytic claim in the implications section; argued conceptually and tied to proposed logging/audit tools; no empirical validation.

medium positive Designing for Disagreement: Front-End Guardrails for Assista... verifiability/auditability (availability of logs), regulator/platform monitoring...

The pattern can reduce costly misallocations caused by LLM unpredictability by constraining policy options, improving overall allocation efficiency in expectation.

Theoretical argument in the paper tying constrained policy space to reduced variability and misallocation risk; no empirical testing or quantitative model provided.

medium positive Designing for Disagreement: Front-End Guardrails for Assista... allocation efficiency (time-to-help, correct-priority assignments, resource util...

The pattern improves legibility, procedural legitimacy, and actionability compared to systems without these elements (proposed as evaluation goals).

Evaluation agenda and proposed user-study metrics in the paper (legibility tests, perceived fairness surveys, contest effectiveness measures); no empirical results yet.

medium positive Designing for Disagreement: Front-End Guardrails for Assista... legibility (user comprehension), procedural legitimacy (perceived fairness), act...

Bounded calibration with contestability avoids opaque silent defaults that mask value choices and avoids wide-open user-configurable value sliders that offload moral choice under stress.

Normative rationale and argumentation in the paper; compared qualitatively against two alternative design approaches; no empirical comparison.

medium positive Designing for Disagreement: Front-End Guardrails for Assista... reduction in hidden value-skews and offloaded moral choice (qualitative assessme...

Bounded calibration with contestability is a viable design pattern for LLM-enabled robots that must allocate scarce, real-time assistance among multiple people.

Conceptual/design proposal in the paper; illustrated with a concrete public-concourse robot vignette; no empirical deployment or sample data reported.

medium positive Designing for Disagreement: Front-End Guardrails for Assista... feasibility/viability of the design pattern (qualitative)

Modular strategy/execution architectures (like ESE) can materially improve the stability and efficiency of LLM-driven operational decision systems, increasing their attractiveness for deployment in retail, logistics, and supply-chain contexts.

Empirical improvements observed with ESE on RetailBench relative to monolithic baselines, coupled with analysis of deployment considerations and domain relevance discussed in the paper.

medium positive RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... operational stability and efficiency improvements as proxies for deployment attr...

ESE improves operational stability and efficiency relative to baselines that do not separate strategy from execution.

Empirical comparisons reported in the experiments: eight contemporary LLMs evaluated on multiple RetailBench environments, with ESE compared against monolithic LLM agents and other baselines using metrics of operational stability (e.g., variance or frequency of catastrophic failures) and efficiency (e.g., cost/profit/fulfillment).

medium positive RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... operational stability (variance/frequency of catastrophic failures) and efficien...

ESE enables interpretable and adaptive strategy updates intended to counteract error accumulation and environmental drift.

Design features of the strategy module (slower updates, interpretable strategy representation) and qualitative analysis in the paper linking these features to reduced error accumulation and strategy drift in experiments.

medium positive RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... interpretability of strategy updates and reduction in error accumulation/strateg...

Policy implication: prioritize large-scale, targeted reskilling and lifelong learning programs to enable workforce adaptability and capture AI complementarity gains.

Policy recommendations derived from the paper's findings (association between AI adoption and skill shifts, heterogeneous sectoral impacts) and the literature synthesis that links reskilling interventions to better labor outcomes; recommendation is prescriptive rather than empirically tested within the study.

medium positive AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Policy effect is recommended but not empirically measured in the study (intended...

The paper provides empirical support for the complementarity hypothesis: AI tends to reconfigure jobs and create hybrid roles rather than eliminate employment wholesale.

Convergence of simulated sectoral employment patterns (some sectors showing net gains and hybrid-role growth), the strong correlation between AI adoption and skill shifts (r = 0.71), and corroborating studies from the literature synthesis emphasizing augmentation and hybridization mechanisms.

medium positive AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Employment change and hybrid job share (evidence for complementarity vs. substit...

Institutional reskilling programs and governance frameworks markedly moderate labor-market outcomes: better frameworks correlate with more complementarities and lower net job loss.

Integration of literature-derived mechanisms with simulated empirical patterns; paper reports correlations/moderation-style comparisons across simulated sector-year cases incorporating policy/institutional variables (described in methods), supported by studies in the systematic review linking policy interventions to labor outcomes.

medium positive AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Net employment change; measures of complementarity (e.g., hybrid share) conditio...

Healthcare and IT Services experienced net employment gains consistent with AI complementarity (augmented tasks and creation of new hybrid roles).

Simulated sectoral employment trends and net-change metrics for Healthcare and IT Services (2020–2024) presented in the paper, supported by literature synthesis examples showing human–AI complementarities in these sectors.

medium positive AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Employment levels and net change by sector (Healthcare, IT Services)

The largest rises in hybrid jobs occurred in IT Services and Healthcare.

Sectoral decomposition of hybrid job share trends in the simulated dataset across the seven industries (2020–2024) and supporting qualitative/quantitative findings from the literature synthesis focused on IT Services and Healthcare.

medium positive AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Hybrid job share by sector (IT Services, Healthcare)

Hybrid human–AI jobs increased substantially across all seven analyzed sectors between 2020 and 2024.

Descriptive trend analysis of the simulated dataset's hybrid job share metric (fraction of roles reclassified as human–AI hybrid) for the seven industries over 2020–2024, combined with corroborating examples from the literature synthesis (selected ACM/IEEE/Springer studies 2020–2024).

medium positive AI-Driven Transformation of Labor Markets: Skill Shifts, Hyb... Hybrid job share (sector-level, 2020–2024)

Firms should pair strong-performing ensemble/deep models with explainability tools (e.g., feature-importance, SHAP) and fairness audits, and prefer pilot human-in-the-loop implementations to validate economic impacts and reduce operational risks.

Authors' practical recommendations based on empirical model performance, interpretability analyses, and noted limitations; presented as guidance rather than empirically validated interventions.

medium positive Adoption of AI-Based HR Analytics and Its Impact on Firm Pro... Recommended practices for deployment (procedural guidance, not an outcome metric...

Variable-contribution analyses (feature importance / model explanation techniques) clarified which inputs drive predictions, making results actionable for HR decision-making.

The paper reports use of feature-importance and model-explanation methods to quantify variable contributions and interpretable outputs intended for HR practitioners.

medium positive Adoption of AI-Based HR Analytics and Its Impact on Firm Pro... Interpretability outputs (feature importance / explanation scores) linked to job...

Employee engagement/participation levels, learning agility (pace of acquiring new skills), tenure in current role, and perceived workload/manageability are consistently among the most important predictors of job performance in the datasets examined.

Feature-importance and model-explanation analyses (e.g., feature importance, SHAP-style approaches) applied across multiple publicly available workforce datasets produced consistently high importance scores for these variables.

medium positive Adoption of AI-Based HR Analytics and Its Impact on Firm Pro... Variable importance for predicting job performance

The models' superior performance hinges on their ability to capture complex, non-linear patterns in features (e.g., engagement, learning agility, tenure, workload perception).

Inference from comparative model performance: non-linear models (ensembles, DNNs) outperform linear baselines; feature engineering captured engagement dynamics and learning trends; variable-contribution analyses highlighted these feature types as influential.

medium positive Adoption of AI-Based HR Analytics and Its Impact on Firm Pro... Contribution of non-linear feature interactions to predictive performance (refle...

« Prev 1 2 3 … 83 84 85 … 103 104 Next »