Evidence (5539 claims)
Adoption
5539 claims
Productivity
4793 claims
Governance
4333 claims
Human-AI Collaboration
3326 claims
Labor Markets
2657 claims
Innovation
2510 claims
Org Design
2469 claims
Skills & Training
2017 claims
Inequality
1378 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 402 | 112 | 67 | 480 | 1076 |
| Governance & Regulation | 402 | 192 | 122 | 62 | 790 |
| Research Productivity | 249 | 98 | 34 | 311 | 697 |
| Organizational Efficiency | 395 | 95 | 70 | 40 | 603 |
| Technology Adoption Rate | 321 | 126 | 73 | 39 | 564 |
| Firm Productivity | 306 | 39 | 70 | 12 | 432 |
| Output Quality | 256 | 66 | 25 | 28 | 375 |
| AI Safety & Ethics | 116 | 177 | 44 | 24 | 363 |
| Market Structure | 107 | 128 | 85 | 14 | 339 |
| Decision Quality | 177 | 76 | 38 | 20 | 315 |
| Fiscal & Macroeconomic | 89 | 58 | 33 | 22 | 209 |
| Employment Level | 77 | 34 | 80 | 9 | 202 |
| Skill Acquisition | 92 | 33 | 40 | 9 | 174 |
| Innovation Output | 120 | 12 | 23 | 12 | 168 |
| Firm Revenue | 98 | 34 | 22 | — | 154 |
| Consumer Welfare | 73 | 31 | 37 | 7 | 148 |
| Task Allocation | 84 | 16 | 33 | 7 | 140 |
| Inequality Measures | 25 | 77 | 32 | 5 | 139 |
| Regulatory Compliance | 54 | 63 | 13 | 3 | 133 |
| Error Rate | 44 | 51 | 6 | — | 101 |
| Task Completion Time | 88 | 5 | 4 | 3 | 100 |
| Training Effectiveness | 58 | 12 | 12 | 16 | 99 |
| Worker Satisfaction | 47 | 32 | 11 | 7 | 97 |
| Wages & Compensation | 53 | 15 | 20 | 5 | 93 |
| Team Performance | 47 | 12 | 15 | 7 | 82 |
| Automation Exposure | 24 | 22 | 9 | 6 | 62 |
| Job Displacement | 6 | 38 | 13 | — | 57 |
| Hiring & Recruitment | 41 | 4 | 6 | 3 | 54 |
| Developer Productivity | 34 | 4 | 3 | 1 | 42 |
| Social Protection | 22 | 10 | 6 | 2 | 40 |
| Creative Output | 16 | 7 | 5 | 1 | 29 |
| Labor Share of Income | 12 | 5 | 9 | — | 26 |
| Skill Obsolescence | 3 | 20 | 2 | — | 25 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
Adoption
Remove filter
Technical building blocks leveraged in these deployments include large language models (LLMs), OCR plus structured information extraction, retrieval-augmented generation (RAG) and knowledge bases, and process automation/RPA.
Explicit technical characteristics section and case descriptions in the paper identify these components as core to implementations.
Generative AI is used for risk control and audit functions, including real-time monitoring, fraud detection, KYC/AML screening, and automated exception reporting.
Reported use-cases in the two case organizations and corroborating industry reports discussed in the literature review portion of the paper.
For tax declaration, generative AI enables extraction of tax-relevant facts from invoices and contracts, drafting of tax returns, compliance checks, and scenario simulations.
Case examples and literature synthesis describing OCR + information extraction and LLM-assisted drafting workflows used in practice.
Generative AI is applied to fund management tasks such as cashflow forecasting, anomaly detection, and automated workflows for payments and collections.
Case descriptions and technical mapping in the paper showing implementations at the sharing center and professional services firm level.
Accounting automation use-cases include automated bookkeeping, reconciliations, journal entry suggestion, and error detection using LLMs and document understanding.
Detailed scope mapping and case examples in Xiaomi and Deloitte illustrating these accounting applications; supported by literature review of technical capabilities.
Realizing those AI-driven gains in Vietnam requires legal and institutional redesigns.
Close reading of Vietnam's constitutional provisions, administrative statutes, procedural rules and judicial doctrine (doctrinal legal analysis) combined with comparative lessons from other jurisdictions; no quantitative data.
A supplemental theological differentiator probe achieved perfect rank-order agreement between the two ceiling judges (Spearman rs = 1.00), supporting judge reliability for the ceiling probe.
Reported Spearman rank correlation rs = 1.00 between Gemini Pro and Copilot Pro on the theological differentiator probe used as a reliability check.
Rigorous research priorities include randomized controlled trials with long-run follow-ups, cost-effectiveness studies, structural adoption models, and validated metrics for feedback quality and learning durability.
Actionable research recommendations produced by the 50-scholar interdisciplinary meeting; prescriptive synthesis rather than empirical results.
CABP (Context-Aware Broker Protocol) extends JSON-RPC with identity-scoped request routing via a six-stage broker pipeline to ensure correct identity and policy propagation.
Design and protocol specification included in the paper; formal description and broker-pipeline semantics documented as a deliverable.
Observations span multiple agent platforms (Moltbook, The Colony, 4claw) with more than 167,000 agents interacting as peers.
Author-reported coverage from naturalistic observations across the named platforms during the one-month observation window; count reported as ≈167k agents.
An asynchronous sliding-window engine treats the GPU as a sliding compute window and overlaps GPU computation with CPU-side parameter updates and multi-tier I/O to hide data movement and synchronization overheads.
System design and implementation described in the paper: an asynchronous runtime that coordinates GPU kernels, CPU updates, and multi-tier I/O. This is a design/implementation claim rather than a measured outcome; the summary links the design to performance improvements.
The A-ToM mechanism operates by estimating a partner's likely ToM order from interaction history and using that estimate to predict the partner's next action which then informs the agent's policy choices.
Method description and implementation details provided in the paper: estimator over ToM orders based on past interactions + conditional action prediction feeding into decision-making; validated in the reported experiments.
Empirical evaluation was performed across four coordination environments: a repeated matrix game, two grid navigation tasks, and an Overcooked task.
Methods section describes these four benchmark environments used for all reported comparisons between fixed-order agents and A-ToM agents; evaluation metrics were joint payoffs and task-specific success measures.
Structured argumentation frameworks make chains of inference inspectable and machine-checkable, improving transparency and verifiability of AI outputs.
Argument from formal properties of AFs and representation; no empirical user studies but relies on known formal semantics.
Computational argumentation offers formal, verifiable reasoning representations (argumentation frameworks, attack/support relations).
Established literature on formal argumentation (e.g., Dung-style AFs) and the paper's conceptual description; no new empirical data reported.
The development artifacts are fully transparent and reproducible: the repository includes an archive of 229 human prompts and a git history with 213 commits.
Paper reports counts of prompts (229) and git commits (213) and states these archives are public; these are concrete repository metrics (n=1 development repository).
The Lean kernel provided full machine verification of all formalized statements in the development.
Paper reports 'Full verification by the Lean kernel' for the Lean 4 development; supported by availability of the Lean 4 repository and verified theorem artifacts (n=1 project).
A specialized prover (Aristotle) automatically closed 111 lemmas during the development.
Quantitative verification metric reported in the paper: 111 lemmas automatically closed by Aristotle; claim tied to the Lean development and prover logs (single project count).
The AI-assisted pipeline combined an AI reasoning model (Gemini DeepThink) to generate the proof, an agentic coding tool (Claude Code) to translate the proof to Lean, a specialized automated prover (Aristotle) that closed 111 lemmas, and the Lean kernel to fully verify the result.
Project workflow description and verification metrics in the paper; reported counts and named components (Gemini DeepThink, Claude Code, Aristotle, Lean kernel); repository and logs purportedly document toolchain usage (n=1 project; 111 lemmas closed by Aristotle reported).
A complete formalization in Lean 4 of the equilibrium characterization for the Vlasov–Maxwell–Landau (VML) system was produced through an AI-assisted pipeline.
Single-project artifact: a Lean 4 development containing formal statements, proof scripts and verified theorems reported by the paper (n=1 project); authors report full machine verification by the Lean kernel and provide the repository as public evidence.
Evaluation metrics for the benchmark include task-specific metrics such as win-rate for battling and completion time for speedruns, as well as strategic robustness measures.
Paper's evaluation section lists metrics used: win-rate, completion time, strategic robustness; describes how they are computed and used to compare agents.
Speedrunning Track includes an open-source multi-agent orchestration system and standardized evaluation scenarios for reproducible multi-agent comparisons.
Paper describes and releases an open-source orchestration harness for orchestrating LLMs/agents and provides standardized scenarios and evaluation tools meant for reproducibility.
Community interest in the benchmark was validated by a NeurIPS 2025 competition with 100+ teams and published analyses of winning submissions.
Paper reports organization/validation via a NeurIPS 2025 competition, states participation of 100+ teams, and includes documentation/analyses of top submissions.
The project is a living benchmark: the Battling Track has a live leaderboard and the Speedrunning Track uses self-contained evaluation to ensure reproducibility.
Paper/documentation notes a live leaderboard for Battling and provides self-contained evaluation pipelines/orchestration for Speedrunning intended to support reproducible runs.
Baselines include heuristic rule-based agents, reinforcement-learning (RL) agents trained for specialist play, and LLM-based agents/harnesses for generalist approaches.
Paper presents baseline implementations and experiments spanning heuristic, RL, and LLM-based agents and describes training procedures and architectures used for each baseline category.
The benchmark is split into two complementary tracks: a Battling Track (competitive, partial-observability battles) and a Speedrunning Track (long-horizon RPG tasks with a multi-agent orchestration harness).
Paper structure and dataset descriptions specify two tracks, their scopes, and the inclusion of a multi-agent orchestration system for the Speedrunning Track.
The Battling Track dataset contains more than 20 million recorded battle trajectories.
Paper reports a Battling Track dataset of >20M recorded battle trajectories collected from simulated/match play; size reported explicitly in dataset and methods section.
PokeAgent Challenge is a large, realistic multi-agent benchmark built on Pokemon that stresses partial observability, game-theoretic reasoning, and long-horizon planning simultaneously.
Paper describes design and motivation of the benchmark, detailing two tracks (Battling and Speedrunning) intended to capture partial observability, adversarial/game-theoretic interactions, and long-horizon sequential planning; benchmark implementation built on Pokemon simulator and described task specifications.
iDaVIE's modular architecture supports extensibility (planned features include subcube loading, advanced render modes, video scripting, and collaborative VR sessions).
Paper describes modular architecture and lists planned/possible future features; this is a software design claim rather than an empirical result.
Because iDaVIE is open-source and extensible, software licensing costs are low and marginal adoption costs fall over time.
Paper states iDaVIE is open-source and designed for community-driven enhancements; economic claim based on general properties of open-source software rather than empirical cost accounting.
iDaVIE includes interaction features such as selection, cropping/subcube tools, catalogue overlays, and export back to existing pipelines.
Feature list in paper describing selection, cropping, overlays, in-VR metrics and export functionality; demonstrated integration to export edited masks/subcubes.
Streaming and downsampling pipelines implemented as Unity plug-ins make large volumes interactively viewable in VR while preserving needed detail for inspection.
Technical description of custom Unity plug-ins for streaming/downsampling and on-the-fly statistics; tested on HI cubes (telescopes listed) per the paper.
iDaVIE (v1.0) is a working VR software suite that lets astronomers import, render, inspect, and interactively edit very large 3D data cubes in real time.
Described implementation of iDaVIE v1.0 built on Unity/SteamVR with custom plug-ins for parsing/downsampling and real-time rendering; tested on large 3D spectral (HI) cubes from radio telescopes (MeerKAT, ASKAP, APERTIF) as reported in the paper.
Personalized LLM coaching produced a statistically significant increase in alignment with the normative empathic taxonomy relative to both the video-based non-personalized feedback and control arms.
Pre-registered randomized experiment with three arms; pre-registered analysis reported statistically significant differences favoring personalized coaching on the primary alignment outcome.
A brief, personalized coaching intervention delivered by a large language model significantly improves participants' alignment with normative, idiomatic empathic communication patterns.
Pre-registered randomized controlled trial with three arms (personalized LLM coaching, video-based non-personalized feedback, control). Outcome measured as alignment to a data-driven normative taxonomy via coding/automated measures. Overall corpus and sample context: 968 participants, 2,904 conversations, 33,938 messages used in the study.
HindSight reveals a large, real difference between systems that is missed by LLM-based judging (i.e., HindSight detects the retrieval-augmentation advantage while LLM-judged metrics do not).
Combined empirical results: HindSight shows a 2.5× advantage (p < 0.001) for retrieval augmentation while LLM-as-Judge reports no significant difference (p = 0.584).
Experiments in the paper cover 10 AI/ML research topics and use a 30-month forward evaluation window.
Experimental setup reported in the paper: scope explicitly stated as 10 AI/ML topics and a 30-month forward window after cutoff T.
Generated ideas can be algorithmically compared to future publications and matched items can be assigned scores reflecting downstream impact (citation counts and venue acceptance).
Method section: description of algorithmic matching procedure and scoring rules that use citation counts and venue acceptance as impact proxies.
A retrieval-augmented idea generator produces 2.5× higher-scoring ideas than a vanilla generator according to HindSight (p < 0.001).
Empirical comparison reported in the paper across the specified experiments (10 AI/ML topics, time-split at T, 30-month forward window); statistical test reporting a 2.5× difference with p < 0.001.
HindSight is a time-split, retrospective evaluation that (1) restricts idea generation to pre-cutoff literature (time T), (2) compares generated ideas to papers published in the following 30 months, and (3) scores matches by downstream impact (citation counts and venue acceptance).
Method described in paper: time-split protocol with a temporal cutoff T, a 30-month forward window, algorithmic matching of generated ideas to later publications, and scoring based on downstream impact metrics (citations and venue acceptance).
LEAFE achieves up to a 14% absolute improvement on Pass@128 versus the strongest baselines.
Empirical result explicitly reported in the paper: maximum observed improvement 'up to +14% Pass@128' in comparisons to baselines on the experimental tasks.
Compared with outcome-driven methods (e.g., GRPO) and experience-based baselines (e.g., Early Experience), LEAFE yields consistent gains in Pass@1 and Pass@k under fixed interaction budgets.
Head-to-head experimental comparisons reported between LEAFE and baselines GRPO and Early Experience on the task suite; fixed interaction-budget experimental regime; Pass@1 and Pass@k used as evaluation metrics.
LEAFE substantially improves long-horizon agentic performance by internalizing recovery behavior learned from environment feedback.
Reported experiments on a suite of long-horizon interactive tasks (multi-step coding and agentic tasks) comparing LEAFE to baselines; evaluation using Pass@k metrics under fixed interaction budgets; qualitative description that LEAFE internalizes recovery behavior from environment feedback.
The RL fine-tuned Qwen2.5-Coder-7B improves 33.1% over the same base 7B model without RL fine-tuning.
Head-to-head comparison between the tuned model and its untuned base across the 48 evaluation briefs; reported improvement of +33.1%.
Fine-tuning a parameter-efficient 7B model (Qwen2.5-Coder-7B) via reinforcement learning in an OpenEnv-compatible environment yields near-state-of-the-art automated slide-generation: the tuned 7B model reaches 91.2% of Claude Opus 4.6’s quality.
Empirical evaluation on 48 diverse business briefs comparing six models; reported relative quality score of tuned Qwen2.5-Coder-7B = 91.2% of Claude Opus 4.6.
Managing captures, traces, and replay sessions from a unified single design database ensures consistency across replay targets and sessions.
Method description emphasizes a single design database coordinating captures and replays across simulation and emulation for the demonstrator system. (Operational claim demonstrated in the implementation; no metrics on error reduction provided.)
The captured traces can be deterministically replayed across different execution targets (software/hardware simulation and hardware emulation), reducing cross-platform setup complexity and discrepancies.
The same captured waveforms/traces were replayed on both simulation and emulation environments for the ODIN demonstrator; cross-target replay was part of the described method. (Demonstrated on the single reported system; no broad cross-toolchain study provided.)
Temporally grounding model inputs (constraining models to contemporaneous public information at each node) substantially reduces the risk of training-data leakage and hindsight bias.
Study design enforced node-specific contemporaneous evidence constraints for each of the 11 nodes; methodological rationale and comparison to unconstrained settings described as reducing retrospective information contamination.
BATQuant significantly outperforms prior post-training quantization (PTQ) methods on MXFP4 microscaling floating-point formats under aggressive quantization.
Comparative experiments against rotation-based PTQ techniques and other existing PTQ baselines on the described multimodal and language tasks; improvements shown in benchmark metrics and recovery percentages in the paper's experimental section.
BATQuant recovers up to 96.43% of full-precision performance under aggressive W4A4KV16 quantization on MLLMs and LLMs.
Empirical evaluation reported in the paper: experiments on multiple multimodal large language models (MLLMs) and standard LLMs using an aggressive W4A4KV16 quantization setup; performance reported as percentage of full-precision performance recovered (specific models, benchmark names, and exact sample sizes not enumerated in the summary).