Evidence (13870 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	196	98	892	1984
Governance & Regulation	817	394	188	121	1544
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	627	233	123	96	1088
Research Productivity	411	123	56	332	933
Output Quality	467	178	59	47	751
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	167	122	24	496
Task Allocation	207	64	71	32	379
Skill Acquisition	165	59	60	17	301
Innovation Output	203	27	43	18	292
Employment Level	105	52	107	13	279
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	150	48	26	3	227
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	63	20	12	184
Error Rate	69	92	10	2	173
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	93	21	13	19	148
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	17	7	3	59
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

New metrics are needed to value tacit capabilities — e.g., measures of transfer, generalization under distribution shifts, ease of integrating with human workflows, and irreducibility to compressed rule representations.

Methodological recommendation in the paper listing specific metric categories for future empirical work.

high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... proposed metrics for assessing tacit LLM capabilities

Suggested empirical validations (not performed) include benchmarking LLMs versus rule systems on allegedly rule‑encodable tasks, attempting rule extraction and measuring fidelity loss, and compression/distillation studies to quantify irreducible task performance.

Recommendations and proposed experimental directions listed in the paper; these are proposals, not executed studies.

high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... types of empirical tests recommended for validating the thesis

The paper contains mostly qualitative and historically grounded empirical content and reports no primary datasets or large‑scale experimental results in support of the formal thesis.

Explicit declaration in the Data & Methods section that empirical content is qualitative/historical and no new datasets were collected.

high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... extent of empirical/quantitative evidence presented

The paper's core methodological approach is conceptual and theoretical argumentation (formal/logical proof, historical examples, and philosophical framing), not empirical experimentation.

Stated Data & Methods description indicating reliance on formal logic, historical case analysis, and philosophical argument; absence of primary datasets.

high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... presence/absence of empirical experiments in the paper

LLM-as-Judge finds no significant difference between the retrieval-augmented and vanilla generators (p = 0.584).

Comparative evaluation using standard LLM-as-Judge metrics reported in the paper on the same experimental setup; reported p-value = 0.584.

high null result HindSight: Evaluating LLM-Generated Research Ideas via Futur... LLM-judge evaluation metric (e.g., LLM-assigned quality/novelty scores for gener...

MessyKitchens is designed to stress occlusion, object variety, and complex inter-object relations (i.e., it is more realistic/physically-rich than prior datasets).

Design and motivation section in paper stating dataset construction targets clutter, occlusion, object variety, and complex object relations; dataset includes explicit contact annotations to capture interactions.

high null result MessyKitchens: Contact-rich object-level 3D scene reconstruc... dataset characteristics: levels of occlusion, object variety, and annotated obje...

MessyKitchens is a high-fidelity real-world dataset of cluttered indoor kitchen scenes with object-level 3D ground truth (object shapes, object poses, and explicit contact information between objects).

Dataset description in paper: collected real-world kitchen scenes and annotated object-level 3D shapes, poses, and contact/interaction labels. (No scene/instance counts provided in the supplied summary.)

high null result MessyKitchens: Contact-rich object-level 3D scene reconstruc... dataset contents: object 3D shapes, object poses, object contact/interaction ann...

The LEAFE algorithmic procedure: summarize environment feedback into compact experience items; backtrack to earlier decision points causally linked to failures and re-explore corrective action branches; distill corrected trajectories into the policy via supervised fine-tuning.

Method section / algorithm description in paper specifying the reflective/backtracking and distillation pipeline as the core of LEAFE.

high null result Internalizing Agency from Reflective Experience N/A (algorithmic procedure description rather than an outcome)

Human-quality proxies were used for evaluation and comparisons were made against Claude Opus 4.6 and other baselines.

Evaluation description: use of human-quality proxy metrics and direct comparisons across models on the 48-brief benchmark.

high null result Learning to Present: Inverse Specification Rewards for Agent... Human-quality proxy scores and comparative model rankings

The reward function is a composite multi-component signal combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics (factuality, coverage, coherence), and an inverse-specification reward.

Reward design section enumerating each component and how they contribute to the composite reward used in RL training.

high null result Learning to Present: Inverse Specification Rewards for Agent... Components of the reward signal used for RL training

The RL environment is OpenEnv-compatible and enables agent tool use for web/knowledge access, planning, and a rendering pipeline.

Methods description: OpenEnv-compatible RL environment with tool interfaces (web/knowledge access and rendering) used during multi-turn planning and execution.

high null result Learning to Present: Inverse Specification Rewards for Agent... Environment capabilities: OpenEnv compatibility and tool-use support

Code for the environment and experiments is released at the specified GitHub repository.

Artifacts: code release reported at https://github.com/pushing-the-frontier/slide-forge-llm.

high null result Learning to Present: Inverse Specification Rewards for Agent... Availability of experiment code (GitHub repo)

The SlideRL dataset of 288 multi-turn rollout trajectories across six models is released for reproducibility.

Artifacts released: SlideRL dataset reported as 288 multi-turn rollouts, hosted at provided Hugging Face URL.

high null result Learning to Present: Inverse Specification Rewards for Agent... Number of rollout trajectories in dataset (288) and coverage across models (6)

Evaluation was conducted on 48 diverse business briefs across six models.

Data & Methods: evaluation suite comprised 48 business briefs selected for diversity; six models compared.

high null result Learning to Present: Inverse Specification Rewards for Agent... Number of evaluation tasks (48 briefs) and number of models compared (6)

Training prompts were derived from expert demonstrations collected using Claude Opus 4.6 to bootstrap training data.

Methods: expert demonstration prompts collected from Claude Opus 4.6 used as seed/bootstrapping data for training.

high null result Learning to Present: Inverse Specification Rewards for Agent... Source of demonstration prompts (Claude Opus 4.6)

Fine-tuning was done parameter-efficiently: only 0.5% of the Qwen2.5-Coder-7B parameters were trained using GRPO.

Methods section: GRPO-based reinforcement learning fine-tuning, with parameter-efficient update covering 0.5% of model parameters.

high null result Learning to Present: Inverse Specification Rewards for Agent... Proportion of model parameters updated during training (0.5%)

Detailed quantitative coverage, throughput, or other numeric validation metrics were not reported beyond the timeline (quarter-level) claim.

Summary states measured benefits were qualitative and process metrics; no detailed quantitative throughput/coverage numbers provided. (Meta-claim about the evidence reported.)

high null result ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... absence of detailed quantitative validation metrics in the reported results

Evaluation used seven benchmarks spanning online computer-use, offline computer-use, and multimodal tool-use reasoning tasks.

Benchmarks section in the summary states seven benchmarks covering those categories; no benchmark names or dataset sizes provided in the summary.

high null result Anticipatory Planning for Multimodal AI Agents benchmark task performance (task success, generalization)

Objectives combine trajectory-level rewards (for global consistency) with stepwise grounded rewards derived from execution outcomes.

Method summary explicitly lists these objectives as part of the TraceR1 training procedure.

high null result Anticipatory Planning for Multimodal AI Agents global plan consistency and stepwise execution outcomes

TraceR1 focuses on short-horizon trajectory forecasting to keep predictions tractable while capturing near-term consequences of actions.

Framework description in summary that emphasizes 'short-horizon trajectory forecasting' as a design choice.

high null result Anticipatory Planning for Multimodal AI Agents forecast horizon (short-horizon) / tractability of predictions

During grounded fine-tuning, tools are treated as frozen agents and only the policy is adjusted using execution feedback (tools are not modified).

Explicit statement in Data & Methods section of the summary describing tool handling during grounded fine-tuning.

high null result Anticipatory Planning for Multimodal AI Agents policy adaptation to tool execution feedback / tool-compatibility of executed ac...

Stage 2 of TraceR1 is a grounded fine-tuning phase that refines step-level accuracy and executability using execution feedback from frozen tool agents.

Method description in summary: Stage 2 — Grounded fine-tuning using execution feedback; tools are not retrained (treated as frozen agents) and feedback is used to adjust the policy.

high null result Anticipatory Planning for Multimodal AI Agents step-level execution accuracy and executability

TraceR1 uses a two-stage training procedure: Stage 1 trains trajectory-level RL on predicted short-horizon trajectories with rewards that enforce global consistency.

Method description in summary: Stage 1 — Trajectory-level RL with trajectory-level rewards to encourage global consistency across predicted action-state sequences.

high null result Anticipatory Planning for Multimodal AI Agents trajectory-level plan coherence / global consistency

Evaluation combined verifiability checks (fact/claim accuracy where possible), qualitative coding of strategic reasoning, and longitudinal comparison across nodes.

Methods description detailing a mixed evaluation approach: verifiability checks for factual items, qualitative coding for strategic narratives, and longitudinal comparisons over the 11 nodes.

high null result When AI Navigates the Fog of War evaluation components (verifiability checks, qualitative coding, longitudinal an...

The evaluation was conducted at 11 discrete temporal nodes during the crisis to capture changing public information and uncertainty.

Methods specification: definition and use of 11 temporal nodes as the backbone of the temporally grounded evaluation.

high null result When AI Navigates the Fog of War number of temporal nodes

The study used 42 node-specific verifiable questions plus 5 broader exploratory prompts to probe factual inferences and higher-level strategic reasoning.

Methods specification: explicit count of 42 verifiable, node-specific questions and 5 exploratory prompts designed by the study authors.

high null result When AI Navigates the Fog of War number and type of questions/prompts used

Measuring the marginal cost of runtime governance, the tradeoff curve between task completion and compliance risk, and calibrating violation probabilities are open empirical research questions identified by the paper.

Explicit list of open problems and proposed empirical research agenda in the Implications/Measurement sections of the paper.

high null result Runtime Governance for AI Agents: Policies on Paths existence of empirical research gaps (identified/not identified)

No large empirical dataset or large-scale field experiments were used; the work is primarily theoretical/formal with simulations and worked examples rather than empirical validation.

Paper's Methods/Data section explicitly states the work is theoretical/formal and lists reference implementation and simulations instead of large empirical studies.

high null result Runtime Governance for AI Agents: Policies on Paths use of empirical data (presence/absence of large-scale empirical evaluation)

Risk calibration—mapping violation probabilities to enforcement actions and thresholds—is a key unsolved operational problem for runtime governance.

Paper highlights open problems including risk calibration; argued via conceptual analysis and operational concerns (false positives/negatives, costs of blocking actions).

high null result Runtime Governance for AI Agents: Policies on Paths existence of calibrated thresholds and procedures (presence/absence)

Because the sample is non-representative (support-group recruitment and media cases) and small (19 users), the authors note that generalizability is limited and the sample is biased toward more severe cases.

Limitations section stating recruitment sources, small N, and bias toward severe cases.

high null result Characterizing Delusional Spirals through Human-LLM Chat Log... representativeness and generalizability of the sample

The study analyzed conversation logs from 19 users who reported psychological harm associated with chatbot use, comprising a total corpus of 391,562 messages (user + chatbot).

Dataset described in paper: 19 users' conversation logs aggregated; total message count reported as 391,562 messages across user and chatbot messages.

high null result Characterizing Delusional Spirals through Human-LLM Chat Log... size of dataset (number of users and total messages)

Two Doherty power amplifier prototypes with GaN HEMT transistors and three-port pixelated combiners were fabricated and tested at 2.75 GHz.

Paper reports fabrication of two prototypes built with GaN HEMT transistors and the optimized three-port pixelated combiners; RF characterization performed at 2.75 GHz.

high null result Deep Learning-Driven Black-Box Doherty Power Amplifier with ... number of fabricated prototypes and test frequency (2.75 GHz)

Key measurable metrics for future evaluation include contest frequency and outcomes, time-to-help for different groups, user satisfaction, perceived fairness, incidence of automation bias, and usability/access disparities.

List of proposed metrics in the paper's evaluation agenda.

high null result Designing for Disagreement: Front-End Guardrails for Assista... the specified metrics (contest frequency/outcomes, time-to-help, satisfaction, p...

The paper does not report empirical data; instead it provides a vignette and a proposed evaluation agenda (user studies, field pilots, A/B tests, logs, surveys).

Explicit methodological statement in the Data & Methods section summarised by the authors; factual description of the paper's empirical status.

high null result Designing for Disagreement: Front-End Guardrails for Assista... presence/absence of empirical data in the paper (binary)

The pattern provides an outcome-specific, easy-to-use contest channel allowing users to contest particular decisions without renegotiating global rules.

Design element described in the paper and exemplified in the vignette; proposed contest metrics and evaluation agenda but no empirical data.

high null result Designing for Disagreement: Front-End Guardrails for Assista... availability and specificity of contest channels (system functionality)

The pattern requires legibility at the contact point so the robot clearly communicates which active mode is in use and why when deferring or prioritizing.

Design specification and rationale in the paper; supported by the public-concourse vignette; no empirical measurement.

high null result Designing for Disagreement: Front-End Guardrails for Assista... legibility of active mode (user understanding at time of deferral)

The pattern constrains prioritization to a governance-approved menu of admissible modes, limiting the policy space to vetted options.

Design specification in the paper (architectural requirement); illustrated in the vignette; no empirical testing.

high null result Designing for Disagreement: Front-End Guardrails for Assista... existence of governance-approved admissible modes (system design property)

Metrics used to evaluate agents include operational stability (e.g., variance or frequency of catastrophic failures), efficiency (e.g., cost/profit/fulfillment), and degradation across increasing task complexity.

Methods and experimental sections specifying the metrics applied to compare ESE and baselines on RetailBench environments.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... operational stability, efficiency, and robustness/degradation metrics

Baselines used in comparisons include monolithic LLM agents and other existing agent architectures that do not implement explicit strategy/execution separation.

Experimental design: baseline descriptions in the methods section specifying monolithic LLM agents and additional architectures lacking explicit temporal decomposition.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... baseline agent architectures used for comparison

Eight state-of-the-art LLMs were evaluated in the study.

Experimental setup description listing eight contemporary LLMs tested across RetailBench environments.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... number of LLMs evaluated (n = 8)

The paper proposes Evolving Strategy & Execution (ESE), a two-tier architecture that separates high-level strategic reasoning (updated at a slower temporal scale) from low-level execution (short-term action selection).

Architectural design described in the methods: explicit decomposition into strategy and execution modules with differing update cadences and stated interpretability/adaptation mechanisms.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... agent architectural modularity (temporal decomposition into strategy vs executio...

RetailBench environments are progressively challenging to stress-test adaptation and planning capabilities (i.e., environments increase in complexity, stochasticity, and non-stationarity).

Benchmark construction described in the paper: multiple environment difficulty levels used to evaluate degradation under increasing challenge; experiments run across these progressive environments.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... environment difficulty gradient (complexity/stochasticity/non-stationarity level...

The paper introduces RetailBench, a high-fidelity long-horizon benchmark for realistic commercial decision-making under stochastic demand and evolving external conditions (non-stationarity).

Design and presentation of the benchmark in the paper: simulated commercial operations with stochastic demand processes and shifting external factors; emphasis on long-horizon evaluation and progressively challenging environments.

high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... benchmark realism and coverage of non-stationarity for long-horizon decision-mak...

Roughly 25% of the training corpus is Italian-language data.

Corpus composition reported by the authors: Italian-language share ≈25% of total training tokens. The summary cites this proportion but does not list the datasets or language-detection methodology.

high null result EngGPT2: Sovereign, Efficient and Open Intelligence percentage share of Italian-language tokens in the training corpus

The model was trained on approximately 2.5 trillion tokens of data.

Training-data size reported in the paper (aggregate token count ≈2.5T). The summary provides this number; no per-dataset breakdown or provenance details are included in the summary.

high null result EngGPT2: Sovereign, Efficient and Open Intelligence total number of training tokens

Approximately 3 billion parameters are active per inference (sparse activation / ~3B active parameters at runtime).

Paper reports sparse MoE design with ≈3B active parameters per forward pass. Evidence comes from model design description (active set / routing), not from independent runtime FLOP logs in the summary.

high null result EngGPT2: Sovereign, Efficient and Open Intelligence active parameters used per inference

EngGPT2-16B-A3B is a Mixture-of-Experts (MoE) model trained from scratch with a total of 16 billion parameters.

Model specification reported in the paper: architecture described as MoE and total parameter count listed as 16B. No contrary empirical test needed; claim is a declarative model spec.

high null result EngGPT2: Sovereign, Efficient and Open Intelligence model architecture and total parameter count

The project developed domain- and specialty-focused models: Fanar-Sadiq (Islamic content multi-agent architecture), Fanar-Diwan (classical Arabic poetry), and FanarShaheen (bilingual translation).

Paper enumerates these domain/specialty models and their stated focuses as part of the product stack.

high null result Fanar 2.0: Arabic Generative AI Stack existence and intended domain of specialized models

FanarGuard is a 4B bilingual moderation model focused on Arabic safety and cultural alignment.

Paper lists FanarGuard in the expanded product stack and specifies model size (4B) and bilingual moderation purpose emphasizing Arabic safety/cultural alignment.

high null result Fanar 2.0: Arabic Generative AI Stack model existence, size (4B), and intended function (bilingual moderation)

Fanar-27B was produced by continual pre-training from a Gemma-3-27B 27B backbone.

Paper describes model development: continual pre-training of Fanar-27B from the Gemma-3-27B 27B backbone.

high null result Fanar 2.0: Arabic Generative AI Stack model lineage/architecture (Fanar-27B ← Gemma-3-27B)

« Prev 1 2 3 … 81 82 83 … 277 278 Next »