Evidence (7448 claims)
Adoption
5267 claims
Productivity
4560 claims
Governance
4137 claims
Human-AI Collaboration
3103 claims
Labor Markets
2506 claims
Innovation
2354 claims
Org Design
2340 claims
Skills & Training
1945 claims
Inequality
1322 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 378 | 106 | 59 | 455 | 1007 |
| Governance & Regulation | 379 | 176 | 116 | 58 | 739 |
| Research Productivity | 240 | 96 | 34 | 294 | 668 |
| Organizational Efficiency | 370 | 82 | 63 | 35 | 553 |
| Technology Adoption Rate | 296 | 118 | 66 | 29 | 513 |
| Firm Productivity | 277 | 34 | 68 | 10 | 394 |
| AI Safety & Ethics | 117 | 177 | 44 | 24 | 364 |
| Output Quality | 244 | 61 | 23 | 26 | 354 |
| Market Structure | 107 | 123 | 85 | 14 | 334 |
| Decision Quality | 168 | 74 | 37 | 19 | 301 |
| Fiscal & Macroeconomic | 75 | 52 | 32 | 21 | 187 |
| Employment Level | 70 | 32 | 74 | 8 | 186 |
| Skill Acquisition | 89 | 32 | 39 | 9 | 169 |
| Firm Revenue | 96 | 34 | 22 | — | 152 |
| Innovation Output | 106 | 12 | 21 | 11 | 151 |
| Consumer Welfare | 70 | 30 | 37 | 7 | 144 |
| Regulatory Compliance | 52 | 61 | 13 | 3 | 129 |
| Inequality Measures | 24 | 68 | 31 | 4 | 127 |
| Task Allocation | 75 | 11 | 29 | 6 | 121 |
| Training Effectiveness | 55 | 12 | 12 | 16 | 96 |
| Error Rate | 42 | 48 | 6 | — | 96 |
| Worker Satisfaction | 45 | 32 | 11 | 6 | 94 |
| Task Completion Time | 78 | 5 | 4 | 2 | 89 |
| Wages & Compensation | 46 | 13 | 19 | 5 | 83 |
| Team Performance | 44 | 9 | 15 | 7 | 76 |
| Hiring & Recruitment | 39 | 4 | 6 | 3 | 52 |
| Automation Exposure | 18 | 17 | 9 | 5 | 50 |
| Job Displacement | 5 | 31 | 12 | — | 48 |
| Social Protection | 21 | 10 | 6 | 2 | 39 |
| Developer Productivity | 29 | 3 | 3 | 1 | 36 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
| Skill Obsolescence | 3 | 19 | 2 | — | 24 |
| Creative Output | 15 | 5 | 3 | 1 | 24 |
| Labor Share of Income | 10 | 4 | 9 | — | 23 |
MessyKitchens is a high-fidelity real-world dataset of cluttered indoor kitchen scenes with object-level 3D ground truth (object shapes, object poses, and explicit contact information between objects).
Dataset description in paper: collected real-world kitchen scenes and annotated object-level 3D shapes, poses, and contact/interaction labels. (No scene/instance counts provided in the supplied summary.)
The LEAFE algorithmic procedure: summarize environment feedback into compact experience items; backtrack to earlier decision points causally linked to failures and re-explore corrective action branches; distill corrected trajectories into the policy via supervised fine-tuning.
Method section / algorithm description in paper specifying the reflective/backtracking and distillation pipeline as the core of LEAFE.
Human-quality proxies were used for evaluation and comparisons were made against Claude Opus 4.6 and other baselines.
Evaluation description: use of human-quality proxy metrics and direct comparisons across models on the 48-brief benchmark.
The reward function is a composite multi-component signal combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics (factuality, coverage, coherence), and an inverse-specification reward.
Reward design section enumerating each component and how they contribute to the composite reward used in RL training.
The RL environment is OpenEnv-compatible and enables agent tool use for web/knowledge access, planning, and a rendering pipeline.
Methods description: OpenEnv-compatible RL environment with tool interfaces (web/knowledge access and rendering) used during multi-turn planning and execution.
Code for the environment and experiments is released at the specified GitHub repository.
Artifacts: code release reported at https://github.com/pushing-the-frontier/slide-forge-llm.
The SlideRL dataset of 288 multi-turn rollout trajectories across six models is released for reproducibility.
Artifacts released: SlideRL dataset reported as 288 multi-turn rollouts, hosted at provided Hugging Face URL.
Evaluation was conducted on 48 diverse business briefs across six models.
Data & Methods: evaluation suite comprised 48 business briefs selected for diversity; six models compared.
Training prompts were derived from expert demonstrations collected using Claude Opus 4.6 to bootstrap training data.
Methods: expert demonstration prompts collected from Claude Opus 4.6 used as seed/bootstrapping data for training.
Fine-tuning was done parameter-efficiently: only 0.5% of the Qwen2.5-Coder-7B parameters were trained using GRPO.
Methods section: GRPO-based reinforcement learning fine-tuning, with parameter-efficient update covering 0.5% of model parameters.
Detailed quantitative coverage, throughput, or other numeric validation metrics were not reported beyond the timeline (quarter-level) claim.
Summary states measured benefits were qualitative and process metrics; no detailed quantitative throughput/coverage numbers provided. (Meta-claim about the evidence reported.)
Evaluation used seven benchmarks spanning online computer-use, offline computer-use, and multimodal tool-use reasoning tasks.
Benchmarks section in the summary states seven benchmarks covering those categories; no benchmark names or dataset sizes provided in the summary.
Objectives combine trajectory-level rewards (for global consistency) with stepwise grounded rewards derived from execution outcomes.
Method summary explicitly lists these objectives as part of the TraceR1 training procedure.
TraceR1 focuses on short-horizon trajectory forecasting to keep predictions tractable while capturing near-term consequences of actions.
Framework description in summary that emphasizes 'short-horizon trajectory forecasting' as a design choice.
During grounded fine-tuning, tools are treated as frozen agents and only the policy is adjusted using execution feedback (tools are not modified).
Explicit statement in Data & Methods section of the summary describing tool handling during grounded fine-tuning.
Stage 2 of TraceR1 is a grounded fine-tuning phase that refines step-level accuracy and executability using execution feedback from frozen tool agents.
Method description in summary: Stage 2 — Grounded fine-tuning using execution feedback; tools are not retrained (treated as frozen agents) and feedback is used to adjust the policy.
TraceR1 uses a two-stage training procedure: Stage 1 trains trajectory-level RL on predicted short-horizon trajectories with rewards that enforce global consistency.
Method description in summary: Stage 1 — Trajectory-level RL with trajectory-level rewards to encourage global consistency across predicted action-state sequences.
Evaluation combined verifiability checks (fact/claim accuracy where possible), qualitative coding of strategic reasoning, and longitudinal comparison across nodes.
Methods description detailing a mixed evaluation approach: verifiability checks for factual items, qualitative coding for strategic narratives, and longitudinal comparisons over the 11 nodes.
The evaluation was conducted at 11 discrete temporal nodes during the crisis to capture changing public information and uncertainty.
Methods specification: definition and use of 11 temporal nodes as the backbone of the temporally grounded evaluation.
The study used 42 node-specific verifiable questions plus 5 broader exploratory prompts to probe factual inferences and higher-level strategic reasoning.
Methods specification: explicit count of 42 verifiable, node-specific questions and 5 exploratory prompts designed by the study authors.
Measuring the marginal cost of runtime governance, the tradeoff curve between task completion and compliance risk, and calibrating violation probabilities are open empirical research questions identified by the paper.
Explicit list of open problems and proposed empirical research agenda in the Implications/Measurement sections of the paper.
No large empirical dataset or large-scale field experiments were used; the work is primarily theoretical/formal with simulations and worked examples rather than empirical validation.
Paper's Methods/Data section explicitly states the work is theoretical/formal and lists reference implementation and simulations instead of large empirical studies.
Risk calibration—mapping violation probabilities to enforcement actions and thresholds—is a key unsolved operational problem for runtime governance.
Paper highlights open problems including risk calibration; argued via conceptual analysis and operational concerns (false positives/negatives, costs of blocking actions).
Because the sample is non-representative (support-group recruitment and media cases) and small (19 users), the authors note that generalizability is limited and the sample is biased toward more severe cases.
Limitations section stating recruitment sources, small N, and bias toward severe cases.
The study analyzed conversation logs from 19 users who reported psychological harm associated with chatbot use, comprising a total corpus of 391,562 messages (user + chatbot).
Dataset described in paper: 19 users' conversation logs aggregated; total message count reported as 391,562 messages across user and chatbot messages.
Two Doherty power amplifier prototypes with GaN HEMT transistors and three-port pixelated combiners were fabricated and tested at 2.75 GHz.
Paper reports fabrication of two prototypes built with GaN HEMT transistors and the optimized three-port pixelated combiners; RF characterization performed at 2.75 GHz.
Key measurable metrics for future evaluation include contest frequency and outcomes, time-to-help for different groups, user satisfaction, perceived fairness, incidence of automation bias, and usability/access disparities.
List of proposed metrics in the paper's evaluation agenda.
The paper does not report empirical data; instead it provides a vignette and a proposed evaluation agenda (user studies, field pilots, A/B tests, logs, surveys).
Explicit methodological statement in the Data & Methods section summarised by the authors; factual description of the paper's empirical status.
The pattern provides an outcome-specific, easy-to-use contest channel allowing users to contest particular decisions without renegotiating global rules.
Design element described in the paper and exemplified in the vignette; proposed contest metrics and evaluation agenda but no empirical data.
The pattern requires legibility at the contact point so the robot clearly communicates which active mode is in use and why when deferring or prioritizing.
Design specification and rationale in the paper; supported by the public-concourse vignette; no empirical measurement.
The pattern constrains prioritization to a governance-approved menu of admissible modes, limiting the policy space to vetted options.
Design specification in the paper (architectural requirement); illustrated in the vignette; no empirical testing.
Metrics used to evaluate agents include operational stability (e.g., variance or frequency of catastrophic failures), efficiency (e.g., cost/profit/fulfillment), and degradation across increasing task complexity.
Methods and experimental sections specifying the metrics applied to compare ESE and baselines on RetailBench environments.
Baselines used in comparisons include monolithic LLM agents and other existing agent architectures that do not implement explicit strategy/execution separation.
Experimental design: baseline descriptions in the methods section specifying monolithic LLM agents and additional architectures lacking explicit temporal decomposition.
Eight state-of-the-art LLMs were evaluated in the study.
Experimental setup description listing eight contemporary LLMs tested across RetailBench environments.
The paper proposes Evolving Strategy & Execution (ESE), a two-tier architecture that separates high-level strategic reasoning (updated at a slower temporal scale) from low-level execution (short-term action selection).
Architectural design described in the methods: explicit decomposition into strategy and execution modules with differing update cadences and stated interpretability/adaptation mechanisms.
RetailBench environments are progressively challenging to stress-test adaptation and planning capabilities (i.e., environments increase in complexity, stochasticity, and non-stationarity).
Benchmark construction described in the paper: multiple environment difficulty levels used to evaluate degradation under increasing challenge; experiments run across these progressive environments.
The paper introduces RetailBench, a high-fidelity long-horizon benchmark for realistic commercial decision-making under stochastic demand and evolving external conditions (non-stationarity).
Design and presentation of the benchmark in the paper: simulated commercial operations with stochastic demand processes and shifting external factors; emphasis on long-horizon evaluation and progressively challenging environments.
Roughly 25% of the training corpus is Italian-language data.
Corpus composition reported by the authors: Italian-language share ≈25% of total training tokens. The summary cites this proportion but does not list the datasets or language-detection methodology.
The model was trained on approximately 2.5 trillion tokens of data.
Training-data size reported in the paper (aggregate token count ≈2.5T). The summary provides this number; no per-dataset breakdown or provenance details are included in the summary.
Approximately 3 billion parameters are active per inference (sparse activation / ~3B active parameters at runtime).
Paper reports sparse MoE design with ≈3B active parameters per forward pass. Evidence comes from model design description (active set / routing), not from independent runtime FLOP logs in the summary.
EngGPT2-16B-A3B is a Mixture-of-Experts (MoE) model trained from scratch with a total of 16 billion parameters.
Model specification reported in the paper: architecture described as MoE and total parameter count listed as 16B. No contrary empirical test needed; claim is a declarative model spec.
The project developed domain- and specialty-focused models: Fanar-Sadiq (Islamic content multi-agent architecture), Fanar-Diwan (classical Arabic poetry), and FanarShaheen (bilingual translation).
Paper enumerates these domain/specialty models and their stated focuses as part of the product stack.
FanarGuard is a 4B bilingual moderation model focused on Arabic safety and cultural alignment.
Paper lists FanarGuard in the expanded product stack and specifies model size (4B) and bilingual moderation purpose emphasizing Arabic safety/cultural alignment.
Fanar-27B was produced by continual pre-training from a Gemma-3-27B 27B backbone.
Paper describes model development: continual pre-training of Fanar-27B from the Gemma-3-27B 27B backbone.
The Fanar 2.0 training corpus is a curated set totalling approximately 120 billion high-quality tokens organized into three data 'recipes' emphasizing Arabic and cross-lingual relevance.
Paper reports a curated corpus of ~120B high-quality tokens split across three data recipes; emphasis on relevance and quality for Arabic and cross-lingual performance.
Training and operations for Fanar 2.0 were performed on-premises using 256 NVIDIA H100 GPUs at QCRI.
Paper states compute and infrastructure: training and operations performed on 256 NVIDIA H100 GPUs, fully on-premises at QCRI (HBKU).
Experiments were conducted on three benchmarks and across multiple LLM families to evaluate generation, scoring, calibration, robustness, and efficiency dimensions.
Data & Methods section summary in the paper stating systematic evaluation across three benchmarks and a variety of LLMs and verifiers.
Complete provenance of training data is often unavailable, so contamination detection is imperfect and some leakage may be undetectable (or overestimated in some categories).
Authors' stated limitation about unavailable/partial training-data provenance and methodological caveats for the lexical-matching pipeline and behavioral probes.
Results are specific to MMLU; contamination levels and effects may differ on other benchmarks or newer models.
Authors' limitations: experiments were conducted only on the MMLU dataset (513 questions) and on the listed six models; generalizability is therefore uncertain.
A three-layer evaluation framework was applied systematically: Layer 1 = syntactic validity; Layer 2 = semantic correctness; Layer 3 = hardware executability (with sublayer 3b = end-to-end evaluation on quantum hardware).
Methods section describes application of a three-layer evaluation framework to each reviewed system, including the explicit sublayer 3b definition.