Evidence (4793 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	402	112	67	480	1076
Governance & Regulation	402	192	122	62	790
Research Productivity	249	98	34	311	697
Organizational Efficiency	395	95	70	40	603
Technology Adoption Rate	321	126	73	39	564
Firm Productivity	306	39	70	12	432
Output Quality	256	66	25	28	375
AI Safety & Ethics	116	177	44	24	363
Market Structure	107	128	85	14	339
Decision Quality	177	76	38	20	315
Fiscal & Macroeconomic	89	58	33	22	209
Employment Level	77	34	80	9	202
Skill Acquisition	92	33	40	9	174
Innovation Output	120	12	23	12	168
Firm Revenue	98	34	22	—	154
Consumer Welfare	73	31	37	7	148
Task Allocation	84	16	33	7	140
Inequality Measures	25	77	32	5	139
Regulatory Compliance	54	63	13	3	133
Error Rate	44	51	6	—	101
Task Completion Time	88	5	4	3	100
Training Effectiveness	58	12	12	16	99
Worker Satisfaction	47	32	11	7	97
Wages & Compensation	53	15	20	5	93
Team Performance	47	12	15	7	82
Automation Exposure	24	22	9	6	62
Job Displacement	6	38	13	—	57
Hiring & Recruitment	41	4	6	3	54
Developer Productivity	34	4	3	1	42
Social Protection	22	10	6	2	40
Creative Output	16	7	5	1	29
Labor Share of Income	12	5	9	—	26
Skill Obsolescence	3	20	2	—	25
Worker Turnover	10	12	—	3	25

Productivity Remove filter

The Flourishing–Justice–Autonomy (FJA) framework should guide alignment efforts, emphasizing (1) Flourishing (human well‑being and meaningful opportunities), (2) Justice (distributional fairness and protection of vulnerable groups), and (3) Autonomy (informed choice and user control).

Prescriptive proposal grounded in conceptual analysis and synthesis of ethical and technical literature; the paper defines and motivates the three principles as its core normative contribution.

high positive LLM Alignment should go beyond Harmlessness–Helpfulness and ... alignment criteria operationalized as Flourishing, Justice, and Autonomy metrics...

The report issues seven policy recommendations grouped into three goals: (1) improve understanding of the emerging threat, (2) strengthen defenses, and (3) ensure responsible development and deployment.

Policy synthesis based on threat analysis and governance review (report-authored recommendations; descriptive).

high positive Highly Autonomous Cyber-Capable Agents: Anticipating Capabil... adoption and implementation of the seven recommended policy actions

The study's strengths include multimethod triangulation, a very large behavioral dataset (150 million interactions), and controlled simulation experiments informed by empirical observation.

Methods reported: mixed‑methods sequential design with (1) 6‑month lab ethnography (n = 23), (2) computational analysis of 150 million customer interactions, and (3) empirically grounded agent‑based simulation experiments.

high positive The Algorithmic Canvas: On the Autopoietic Redefinition of S... study validity/robustness (methodological strength)

The Algorithmic Canvas is an operational medium where segmentation, targeting, and positioning parameters co‑evolve through iterative human–AI collaboration.

Design and implementation described in the study; observation of Canvas‑mediated interactions during a 6‑month lab ethnography inside a Fortune 500 company (n = 23).

high positive The Algorithmic Canvas: On the Autopoietic Redefinition of S... co‑evolution of STP parameters (qualitative and operational behavior observed vi...

Autopoietic STP + Algorithmic Canvas approach is 44% more resilient to market shocks than traditional, process‑based STP (p < 0.01).

Agent‑based simulations and comparative analyses informed by empirical calibration; supported by large‑scale behavioral data (150 million customer interactions) and simulation experiments. Statistical test reported with p < 0.01. Exact number of simulation runs and full test details not specified in the summary.

high positive The Algorithmic Canvas: On the Autopoietic Redefinition of S... resilience to market shocks (comparative resilience between autopoietic vs. trad...

Research priorities include empirically quantifying AI's effects on productivity, wages, inequality, and environmental costs; developing standardized sustainability and governance metrics; and evaluating regulatory impacts on innovation and welfare.

Stated research agenda based on gaps identified in the narrative review; identifies directions for future empirical work rather than presenting new empirical findings.

high positive The Evolution and Societal Impact of Artificial Intelligence... empirical evidence and standardized metrics for AI impacts (productivity, labor-...

AI has progressed from symbolic systems to data-driven, generative architectures and large-scale computational infrastructures, becoming a foundational technology across sectors.

Narrative synthesis of historical and technical literature across AI research and innovation studies; qualitative tracing of architectural shifts (symbolic → statistical → deep learning/generative models) and increased deployment across industries. No original empirical measurement or sample size reported in this paper.

high positive The Evolution and Societal Impact of Artificial Intelligence... technological evolution and cross-sector adoption (foundational-technology statu...

The main results are robust to inclusion of controls and a range of heterogeneity and moderation checks, supporting that findings are not driven by simple time trends or obvious confounders.

Reported robustness checks in the staggered-DID framework (control variables, alternative specifications, subgroup tests) and discussion of parallel-trends assumption.

high positive How Does Urban Green Data Center Policy Empower Corporate En... corporate energy utilization efficiency (stability of estimated policy effect ac...

Implementation of urban green data center pilot policies leads to measurable improvements in firms' energy utilization efficiency.

Staggered-adoption difference-in-differences (DID) using an unbalanced firm–year panel of Chinese A-share listed firms linked to prefecture-level cities (2012–2024); treatment is timing/location of urban green data center pilot designation; results reported as statistically significant and robust to controls and alternative specifications.

high positive How Does Urban Green Data Center Policy Empower Corporate En... corporate energy utilization efficiency

Policy recommendations include standards on explainability, audit trails, certification for finance/tax AI systems, stronger data governance, and public–private coordination to update regulatory guidance.

Paper's policy and governance recommendations drawn from case findings and literature synthesis; prescriptive content rather than evaluated interventions.

high positive Explore the Impact of Generative AI on Finance and Taxation existence/adoption of standards, improvements in regulatory clarity and complian...

Deployments should build governance, explainability, and auditability into systems and start with pilots on high-volume, well-structured tasks before scaling.

Paper recommendations based on case experience and analytic framing; advocated strategy rather than empirically validated at scale within the paper.

high positive Explore the Impact of Generative AI on Finance and Taxation deployment success rate, governance completeness, pilot-to-scale learning outcom...

To mitigate risks and realize benefits, AI systems in finance/tax should combine AI with human-in-the-loop controls and clear escalation paths.

Prescriptive recommendation grounded in case lessons and literature on safe AI deployment; presented as a best-practice guideline rather than tested intervention.

high positive Explore the Impact of Generative AI on Finance and Taxation safety/accuracy of outputs, reduction in erroneous autonomous actions

Technical building blocks leveraged in these deployments include large language models (LLMs), OCR plus structured information extraction, retrieval-augmented generation (RAG) and knowledge bases, and process automation/RPA.

Explicit technical characteristics section and case descriptions in the paper identify these components as core to implementations.

high positive Explore the Impact of Generative AI on Finance and Taxation capability enabling: natural language understanding, document extraction accurac...

Generative AI is used for risk control and audit functions, including real-time monitoring, fraud detection, KYC/AML screening, and automated exception reporting.

Reported use-cases in the two case organizations and corroborating industry reports discussed in the literature review portion of the paper.

high positive Explore the Impact of Generative AI on Finance and Taxation timeliness of monitoring, fraud detection rate, KYC/AML screening coverage, exce...

For tax declaration, generative AI enables extraction of tax-relevant facts from invoices and contracts, drafting of tax returns, compliance checks, and scenario simulations.

Case examples and literature synthesis describing OCR + information extraction and LLM-assisted drafting workflows used in practice.

high positive Explore the Impact of Generative AI on Finance and Taxation accuracy and speed of tax fact extraction, draft return quality, compliance-chec...

Generative AI is applied to fund management tasks such as cashflow forecasting, anomaly detection, and automated workflows for payments and collections.

Case descriptions and technical mapping in the paper showing implementations at the sharing center and professional services firm level.

high positive Explore the Impact of Generative AI on Finance and Taxation cashflow forecast accuracy, anomaly detection precision/recall, automation rate ...

Accounting automation use-cases include automated bookkeeping, reconciliations, journal entry suggestion, and error detection using LLMs and document understanding.

Detailed scope mapping and case examples in Xiaomi and Deloitte illustrating these accounting applications; supported by literature review of technical capabilities.

high positive Explore the Impact of Generative AI on Finance and Taxation functionality/performance in accounting tasks: bookkeeping accuracy, reconciliat...

Realizing those AI-driven gains in Vietnam requires legal and institutional redesigns.

Close reading of Vietnam's constitutional provisions, administrative statutes, procedural rules and judicial doctrine (doctrinal legal analysis) combined with comparative lessons from other jurisdictions; no quantitative data.

high positive ARTIFICIAL INTELLIGENCE AND ADMINISTRATIVE GOVERNANCE: A CRI... feasibility of AI deployment (legal/institutional compatibility enabling efficie...

Rigorous research priorities include randomized controlled trials with long-run follow-ups, cost-effectiveness studies, structural adoption models, and validated metrics for feedback quality and learning durability.

Actionable research recommendations produced by the 50-scholar interdisciplinary meeting; prescriptive synthesis rather than empirical results.

high positive The Future of Feedback: How Can AI Help Transform Feedback t... existence and quality of RCTs and long-run studies; availability of validated me...

An asynchronous sliding-window engine treats the GPU as a sliding compute window and overlaps GPU computation with CPU-side parameter updates and multi-tier I/O to hide data movement and synchronization overheads.

System design and implementation described in the paper: an asynchronous runtime that coordinates GPU kernels, CPU updates, and multi-tier I/O. This is a design/implementation claim rather than a measured outcome; the summary links the design to performance improvements.

high positive An Efficient Heterogeneous Co-Design for Fine-Tuning on a Si... system behavior (overlap of compute and I/O / synchronization)

The A-ToM mechanism operates by estimating a partner's likely ToM order from interaction history and using that estimate to predict the partner's next action which then informs the agent's policy choices.

Method description and implementation details provided in the paper: estimator over ToM orders based on past interactions + conditional action prediction feeding into decision-making; validated in the reported experiments.

high positive Adaptive Theory of Mind for LLM-based Multi-Agent Coordinati... accuracy/usefulness of inferred ToM order for partner-action prediction and subs...

Empirical evaluation was performed across four coordination environments: a repeated matrix game, two grid navigation tasks, and an Overcooked task.

Methods section describes these four benchmark environments used for all reported comparisons between fixed-order agents and A-ToM agents; evaluation metrics were joint payoffs and task-specific success measures.

high positive Adaptive Theory of Mind for LLM-based Multi-Agent Coordinati... coordination performance (joint payoff, success rate) as used in experiments

Operating as a pre-processor (rather than modifying the generator) enables modular integration with existing LLMs and provides an explicit decision point for clarification.

Novelty/architecture claim in the paper explaining that C.A.P. runs before generation and therefore can be plugged into existing LLM pipelines; described design rationale (no empirical integration study presented).

high positive A Context Alignment Pre-processor for Enhancing the Coherenc... ease of integration / ability to attach to existing generation pipelines

C.A.P. verifies semantic alignment between the current expanded prompt and the weighted history and triggers a structured clarification protocol when similarity is below a threshold.

Component-level description: alignment verification via semantic embeddings (cosine similarity) or learned classifiers and threshold-based decision branching to initiate clarification; described protocol templates (no empirical validation provided).

high positive A Context Alignment Pre-processor for Enhancing the Coherenc... alignment detection (similarity score) and number/rate of triggered clarificatio...

C.A.P. retrieves dialogue history using a time-weighted decay so recent context is prioritized (approximating human conversational focus).

Design description of a 'time-weighted context retrieval' component; authors propose temporal decay functions (e.g., exponential decay, half-life parameter) applied to dialogue-turn embeddings or metadata (no empirical results reported).

high positive A Context Alignment Pre-processor for Enhancing the Coherenc... recency-weighted relevance of retrieved context / retrieval precision for recent...

C.A.P. is a pre-generation module that expands user utterances to recover omitted premises and implications.

Architecture and methods description in the paper specifying a 'semantic expansion' component; suggested implementations via knowledge-bases or small LLM prompts to generate premises, paraphrases, and implications (no empirical evaluation reported).

high positive A Context Alignment Pre-processor for Enhancing the Coherenc... recovered implicit premises / coverage of implied goals in expanded prompt

Structured argumentation frameworks make chains of inference inspectable and machine-checkable, improving transparency and verifiability of AI outputs.

Argument from formal properties of AFs and representation; no empirical user studies but relies on known formal semantics.

high positive Argumentative Human-AI Decision-Making: Toward AI Agents Tha... inspectability/traceability of inference chains (auditability)

Computational argumentation offers formal, verifiable reasoning representations (argumentation frameworks, attack/support relations).

Established literature on formal argumentation (e.g., Dung-style AFs) and the paper's conceptual description; no new empirical data reported.

high positive Argumentative Human-AI Decision-Making: Toward AI Agents Tha... existence and machine-checkability of formal inferential chains (inspectability/...

The development artifacts are fully transparent and reproducible: the repository includes an archive of 229 human prompts and a git history with 213 commits.

Paper reports counts of prompts (229) and git commits (213) and states these archives are public; these are concrete repository metrics (n=1 development repository).

high positive Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau E... number of human prompts archived (229); number of git commits (213); public avai...

The Lean kernel provided full machine verification of all formalized statements in the development.

Paper reports 'Full verification by the Lean kernel' for the Lean 4 development; supported by availability of the Lean 4 repository and verified theorem artifacts (n=1 project).

high positive Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau E... machine-checked verification status of formalized statements (verified/unverifie...

A specialized prover (Aristotle) automatically closed 111 lemmas during the development.

Quantitative verification metric reported in the paper: 111 lemmas automatically closed by Aristotle; claim tied to the Lean development and prover logs (single project count).

high positive Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau E... number of lemmas automatically discharged by Aristotle (111)

The AI-assisted pipeline combined an AI reasoning model (Gemini DeepThink) to generate the proof, an agentic coding tool (Claude Code) to translate the proof to Lean, a specialized automated prover (Aristotle) that closed 111 lemmas, and the Lean kernel to fully verify the result.

Project workflow description and verification metrics in the paper; reported counts and named components (Gemini DeepThink, Claude Code, Aristotle, Lean kernel); repository and logs purportedly document toolchain usage (n=1 project; 111 lemmas closed by Aristotle reported).

high positive Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau E... composition of toolchain and number of lemmas automatically discharged (111)

A complete formalization in Lean 4 of the equilibrium characterization for the Vlasov–Maxwell–Landau (VML) system was produced through an AI-assisted pipeline.

Single-project artifact: a Lean 4 development containing formal statements, proof scripts and verified theorems reported by the paper (n=1 project); authors report full machine verification by the Lean kernel and provide the repository as public evidence.

high positive Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau E... completeness of formalization / machine-checked verification of the VML equilibr...

iDaVIE's modular architecture supports extensibility (planned features include subcube loading, advanced render modes, video scripting, and collaborative VR sessions).

Paper describes modular architecture and lists planned/possible future features; this is a software design claim rather than an empirical result.

high positive iDaVIE v1.0: A virtual reality tool for interactive analysis... software extensibility and planned feature set

Because iDaVIE is open-source and extensible, software licensing costs are low and marginal adoption costs fall over time.

Paper states iDaVIE is open-source and designed for community-driven enhancements; economic claim based on general properties of open-source software rather than empirical cost accounting.

high positive iDaVIE v1.0: A virtual reality tool for interactive analysis... licensing cost implication and marginal adoption costs

iDaVIE includes interaction features such as selection, cropping/subcube tools, catalogue overlays, and export back to existing pipelines.

Feature list in paper describing selection, cropping, overlays, in-VR metrics and export functionality; demonstrated integration to export edited masks/subcubes.

high positive iDaVIE v1.0: A virtual reality tool for interactive analysis... availability and functionality of in-VR interaction and export tools

Streaming and downsampling pipelines implemented as Unity plug-ins make large volumes interactively viewable in VR while preserving needed detail for inspection.

Technical description of custom Unity plug-ins for streaming/downsampling and on-the-fly statistics; tested on HI cubes (telescopes listed) per the paper.

high positive iDaVIE v1.0: A virtual reality tool for interactive analysis... interactive rendering performance and retention of inspection-relevant detail

iDaVIE (v1.0) is a working VR software suite that lets astronomers import, render, inspect, and interactively edit very large 3D data cubes in real time.

Described implementation of iDaVIE v1.0 built on Unity/SteamVR with custom plug-ins for parsing/downsampling and real-time rendering; tested on large 3D spectral (HI) cubes from radio telescopes (MeerKAT, ASKAP, APERTIF) as reported in the paper.

high positive iDaVIE v1.0: A virtual reality tool for interactive analysis... ability to import/render/inspect/edit large 3D data cubes in real time (interact...

Personalized LLM coaching produced a statistically significant increase in alignment with the normative empathic taxonomy relative to both the video-based non-personalized feedback and control arms.

Pre-registered randomized experiment with three arms; pre-registered analysis reported statistically significant differences favoring personalized coaching on the primary alignment outcome.

high positive Practicing with Language Models Cultivates Human Empathic Co... statistical difference in alignment to normative empathic patterns (primary outc...

A brief, personalized coaching intervention delivered by a large language model significantly improves participants' alignment with normative, idiomatic empathic communication patterns.

Pre-registered randomized controlled trial with three arms (personalized LLM coaching, video-based non-personalized feedback, control). Outcome measured as alignment to a data-driven normative taxonomy via coding/automated measures. Overall corpus and sample context: 968 participants, 2,904 conversations, 33,938 messages used in the study.

high positive Practicing with Language Models Cultivates Human Empathic Co... alignment with normative empathic patterns (coding/automated alignment metrics)

HindSight reveals a large, real difference between systems that is missed by LLM-based judging (i.e., HindSight detects the retrieval-augmentation advantage while LLM-judged metrics do not).

Combined empirical results: HindSight shows a 2.5× advantage (p < 0.001) for retrieval augmentation while LLM-as-Judge reports no significant difference (p = 0.584).

high positive HindSight: Evaluating LLM-Generated Research Ideas via Futur... Detection of performance difference between retrieval-augmented and vanilla gene...

Experiments in the paper cover 10 AI/ML research topics and use a 30-month forward evaluation window.

Experimental setup reported in the paper: scope explicitly stated as 10 AI/ML topics and a 30-month forward window after cutoff T.

high positive HindSight: Evaluating LLM-Generated Research Ideas via Futur... Scope parameters (number of topics = 10; forward window length = 30 months)

Generated ideas can be algorithmically compared to future publications and matched items can be assigned scores reflecting downstream impact (citation counts and venue acceptance).

Method section: description of algorithmic matching procedure and scoring rules that use citation counts and venue acceptance as impact proxies.

high positive HindSight: Evaluating LLM-Generated Research Ideas via Futur... Match indicators and downstream-impact scores (citations, venue acceptance) for ...

A retrieval-augmented idea generator produces 2.5× higher-scoring ideas than a vanilla generator according to HindSight (p < 0.001).

Empirical comparison reported in the paper across the specified experiments (10 AI/ML topics, time-split at T, 30-month forward window); statistical test reporting a 2.5× difference with p < 0.001.

high positive HindSight: Evaluating LLM-Generated Research Ideas via Futur... HindSight score (downstream-impact-based score for generated ideas)

HindSight is a time-split, retrospective evaluation that (1) restricts idea generation to pre-cutoff literature (time T), (2) compares generated ideas to papers published in the following 30 months, and (3) scores matches by downstream impact (citation counts and venue acceptance).

Method described in paper: time-split protocol with a temporal cutoff T, a 30-month forward window, algorithmic matching of generated ideas to later publications, and scoring based on downstream impact metrics (citations and venue acceptance).

high positive HindSight: Evaluating LLM-Generated Research Ideas via Futur... HindSight match score computed from matches to later publications weighted by ci...

The paper introduces a Multi-Object Decoder (MOD) that extends SAM 3D to jointly reconstruct multiple objects from a single image, targeting physically plausible, non-penetrating object configurations and realistic contacts.

Method section: MOD is described as an extension of the single-object SAM 3D architecture to jointly decode multiple object shapes and poses from a monocular image; the method explicitly aims to reduce inter-object penetration and model contacts.

high positive MessyKitchens: Contact-rich object-level 3D scene reconstruc... methodological capability: joint multi-object monocular 3D reconstruction, objec...

LEAFE achieves up to a 14% absolute improvement on Pass@128 versus the strongest baselines.

Empirical result explicitly reported in the paper: maximum observed improvement 'up to +14% Pass@128' in comparisons to baselines on the experimental tasks.

high positive Internalizing Agency from Reflective Experience Pass@128 (absolute percentage point improvement)

Compared with outcome-driven methods (e.g., GRPO) and experience-based baselines (e.g., Early Experience), LEAFE yields consistent gains in Pass@1 and Pass@k under fixed interaction budgets.

Head-to-head experimental comparisons reported between LEAFE and baselines GRPO and Early Experience on the task suite; fixed interaction-budget experimental regime; Pass@1 and Pass@k used as evaluation metrics.

high positive Internalizing Agency from Reflective Experience Pass@1 and Pass@k (fraction of problems solved among k candidate runs)

LEAFE substantially improves long-horizon agentic performance by internalizing recovery behavior learned from environment feedback.

Reported experiments on a suite of long-horizon interactive tasks (multi-step coding and agentic tasks) comparing LEAFE to baselines; evaluation using Pass@k metrics under fixed interaction budgets; qualitative description that LEAFE internalizes recovery behavior from environment feedback.

high positive Internalizing Agency from Reflective Experience Long-horizon agentic performance measured by Pass@k (Pass@1, Pass@k, Pass@128)

The RL fine-tuned Qwen2.5-Coder-7B improves 33.1% over the same base 7B model without RL fine-tuning.

Head-to-head comparison between the tuned model and its untuned base across the 48 evaluation briefs; reported improvement of +33.1%.

high positive Learning to Present: Inverse Specification Rewards for Agent... Absolute or relative quality improvement (%) of tuned vs. untuned Qwen2.5-Coder-...

« Prev 1 2 3 … 33 34 35 … 95 96 Next »