The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (4793 claims)

Adoption
5539 claims
Productivity
4793 claims
Governance
4333 claims
Human-AI Collaboration
3326 claims
Labor Markets
2657 claims
Innovation
2510 claims
Org Design
2469 claims
Skills & Training
2017 claims
Inequality
1378 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 402 112 67 480 1076
Governance & Regulation 402 192 122 62 790
Research Productivity 249 98 34 311 697
Organizational Efficiency 395 95 70 40 603
Technology Adoption Rate 321 126 73 39 564
Firm Productivity 306 39 70 12 432
Output Quality 256 66 25 28 375
AI Safety & Ethics 116 177 44 24 363
Market Structure 107 128 85 14 339
Decision Quality 177 76 38 20 315
Fiscal & Macroeconomic 89 58 33 22 209
Employment Level 77 34 80 9 202
Skill Acquisition 92 33 40 9 174
Innovation Output 120 12 23 12 168
Firm Revenue 98 34 22 154
Consumer Welfare 73 31 37 7 148
Task Allocation 84 16 33 7 140
Inequality Measures 25 77 32 5 139
Regulatory Compliance 54 63 13 3 133
Error Rate 44 51 6 101
Task Completion Time 88 5 4 3 100
Training Effectiveness 58 12 12 16 99
Worker Satisfaction 47 32 11 7 97
Wages & Compensation 53 15 20 5 93
Team Performance 47 12 15 7 82
Automation Exposure 24 22 9 6 62
Job Displacement 6 38 13 57
Hiring & Recruitment 41 4 6 3 54
Developer Productivity 34 4 3 1 42
Social Protection 22 10 6 2 40
Creative Output 16 7 5 1 29
Labor Share of Income 12 5 9 26
Skill Obsolescence 3 20 2 25
Worker Turnover 10 12 3 25
Clear
Productivity Remove filter
This study implements a Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm to control AVs and trains it using the NGSIM highway dataset to enable realistic interaction with human-driven vehicles.
Methodological description in the paper: implementation of TD3 and training on the NGSIM dataset. Dataset referenced but no numeric sample size reported in the claim text.
high positive Macroscopic Characteristics of Mixed Traffic Flow with Deep ... method used for AV control (TD3 trained on NGSIM)
The result is evidence-based triggers that replace calendar schedules and make governance auditable.
Claimed outcome of applying the decision-theoretic framework in the paper (argumentative; no empirical deployment or case-study evidence reported in the summary).
high positive Retraining as Approximate Bayesian Inference retraining trigger design and governance auditability
The paper provides a decision-theoretic framework for retraining policies.
Explicit claim about the paper's contribution; the article presents a framework (conceptual/methodological exposition).
high positive Retraining as Approximate Bayesian Inference existence of a prescriptive framework for retraining policies
The retraining decision is a cost minimization problem with a threshold that falls out of your loss function.
Decision-theoretic derivation presented in the paper (analytical/theoretical reasoning; no empirical validation reported).
high positive Retraining as Approximate Bayesian Inference formalization of retraining decision rule (cost-minimization/threshold)
Retraining can be better understood as approximate Bayesian inference under computational constraints.
Theoretical argument and decision-theoretic framing presented in the paper (conceptual/mathematical derivation rather than empirical testing).
high positive Retraining as Approximate Bayesian Inference conceptual framing of retraining
The analysis was pre-registered and code and data are publicly available.
Authors' statement in the abstract/paper declaring pre-registration and public release of code and data.
high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... research transparency (pre-registration and public code/data)
The meta-d' framework reveals which models 'know what they don't know' versus which merely appear well-calibrated due to criterion placement — a distinction with direct implications for model selection, deployment, and human-AI collaboration.
Interpretation and implications drawn from empirical results showing dissociations between calibration metrics and metacognitive measures (meta-d', M-ratio, criterion shifts); argument that this distinction informs practical decisions about model use.
high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... distinction between true metacognitive capacity and apparent calibration driven ...
We applied this framework to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials.
Experimental methods reported in the paper listing the four model variants and total trial count (224,000 factual QA trials).
high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... empirical evaluation of models' Type-1 and Type-2 metrics across factual QA tria...
We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio.
Methodological contribution described in the paper: specification of a Type-2 SDT framework and use of meta-d' and M-ratio as measurement constructs.
high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... decomposition of Type-1 vs Type-2 capacities using meta-d' and M-ratio
The best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search.
Analysis comparing origins of the best final designs vs. their ILP ranking, reported across the benchmark set (12).
high positive Agent Factories for High Level Synthesis: How Far Can Genera... origin/ranking of best designs relative to ILP candidates
Larger gains on harder benchmarks: streamcluster exceeds 20× and kmeans reaches approximately 10×.
Per-benchmark empirical results reported for streamcluster and kmeans in the evaluation.
high positive Agent Factories for High Level Synthesis: How Far Can Genera... execution/performance speedup for specific benchmarks
Scaling from 1 to 10 agents yields a mean 8.27× speedup over baseline.
Empirical evaluation across the reported benchmark set comparing performance with 1 agent versus 10 agents; mean speedup stated in the results.
high positive Agent Factories for High Level Synthesis: How Far Can Genera... execution/performance speedup relative to baseline
We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus 4.5/4.6) with AMD Vitis HLS.
Experimental setup described in the paper reporting evaluation on 12 kernels drawn from HLS-Eval and Rodinia-HLS, using Claude Code (Opus 4.5/4.6) and AMD Vitis HLS.
high positive Agent Factories for High Level Synthesis: How Far Can Genera... evaluation dataset and toolchain used
In Stage 2, the pipeline launches N expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition.
Method section describing Stage 2 which runs multiple expert agents exploring cross-function optimizations on top ILP solutions.
high positive Agent Factories for High Level Synthesis: How Far Can Genera... description of Stage 2 expert-agent exploration of cross-function optimizations
In Stage 1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint.
Method section describing Stage 1 decomposition, per-sub-kernel optimization and ILP assembly under an area constraint.
high positive Agent Factories for High Level Synthesis: How Far Can Genera... description of Stage 1 decomposition and ILP-based assembly
We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents.
Method description in the paper describing the design and implementation of the two-stage 'agent factory' pipeline.
high positive Agent Factories for High Level Synthesis: How Far Can Genera... existence and design of the two-stage agent factory pipeline
Deployment validation across 43 classrooms demonstrated an 18x efficiency gain in the assessment workflow.
Field deployment described in the paper: system was validated across 43 classrooms and an efficiency gain of 18x in the assessment workflow is reported.
high positive When AI Meets Early Childhood Education: Large Language Mode... efficiency of the assessment workflow (time/resources per assessment)
Interaction2Eval achieves up to 88% agreement with human expert judgments.
Reported evaluation results comparing Interaction2Eval outputs to human expert annotations (rubric-based judgments) on the dataset.
high positive When AI Meets Early Childhood Education: Large Language Mode... agreement between AI-generated assessments and human expert judgments
Interaction2Eval, an LLM-based framework, addresses domain-specific challenges (child speech recognition, Mandarin homophone disambiguation, rubric-based reasoning).
Methodological description in the paper: a specialized LLM-based pipeline designed to handle listed domain challenges; presented as the approach used to extract structured quality indicators.
high positive When AI Meets Early Childhood Education: Large Language Mode... capability to handle domain-specific technical challenges in automated assessmen...
TEPE-TCI-370h is the first large-scale dataset of naturalistic teacher-child interactions in Chinese preschools (370 hours, 105 classrooms) with standardized ECQRS-EC and SSTEW annotations.
Authors' dataset construction and description: 370 hours of recorded interactions from 105 classrooms, annotated with ECQRS-EC and SSTEW rubrics as reported in the paper.
high positive When AI Meets Early Childhood Education: Large Language Mode... availability of a large-scale annotated dataset for preschool teacher-child inte...
All data and models are publicly released.
Statement in abstract asserting public release of datasets and models.
high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... public availability of data and models
CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models.
Authors' claim about potential use-cases and research enabled by the dataset; forward-looking/qualitative statement.
high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... support for various research directions (capability to enable research)
CUA-Suite provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations.
Dataset/benchmark description in paper: UI-Vision benchmark and GroundCUA counts (56,000 screenshots, >3,600,000 UI element annotations).
high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... size and scope of GroundCUA (annotated screenshots and UI element annotations) a...
Continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks (unlike sparse datasets that capture only final click coordinates).
Argument made in paper contrasting continuous video to sparse screenshots/final click coordinates; conceptual/logical claim about information content and transformability.
high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... information content and transformability of continuous video vs. sparse data
VideoCUA provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layered reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video.
Dataset description and counts reported in paper: ~10,000 tasks, 87 applications, 30 fps, ~55 hours, ~6,000,000 frames, plus annotation modalities.
high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... size and modality coverage of the VideoCUA dataset (tasks, hours, frames, annota...
Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents.
Cites/references recent literature (stated in abstract) asserting the importance of continuous video over sparse screenshots.
high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... importance of continuous video vs. sparse screenshots for scaling CUAs
Computer-use agents (CUAs) hold great promise for automating complex desktop workflows.
Statement in paper's introduction/abstract; conceptual claim based on prior literature and motivation for the work.
high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... promise/ability to automate complex desktop workflows
Extensive offline evaluations demonstrate OneSearch-V2's strong query recognition and user profiling capabilities.
Author statement referencing extensive offline evaluations showing these capabilities; no metrics, datasets, or sample sizes provided in the excerpt.
high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... query recognition and user profiling performance
OneSearch-V2 introduces a behavior preference alignment optimization system which mitigates reward hacking arising from the single conversion metric and addresses personal preference via direct user feedback.
Methodological description of an optimization/feedback component in the paper; no empirical quantification of mitigation or user-feedback effects provided in the excerpt.
high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... mitigation of reward hacking from single-metric optimization and alignment with ...
OneSearch-V2 contains a reasoning-internalized self-distillation training pipeline that uncovers users' potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning.
Methodological description of the training pipeline in the paper; no direct quantitative evidence or ablation results given in the excerpt.
high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... ability to infer latent user intent beyond behavior logs
OneSearch-V2 includes a thought-augmented complex query understanding module that enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference.
Methodological description of the proposed module in the paper; no standalone evaluation numbers for this module provided in the excerpt.
high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... query understanding capability (depth of understanding vs. shallow semantic matc...
OneSearch-V2 effectively mitigates common search system issues such as information bubbles and long-tail sparsity, without incurring additional inference costs or serving latency.
Author claim in the paper stating mitigation of these issues and no added inference/latency costs; no quantitative measures, benchmarks, or latency numbers provided in the excerpt.
high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... information bubbles and long-tail sparsity (and inference/serving latency)
Manual evaluation confirms gains in query-item relevance, with +1.37%.
Reported manual evaluation metric in the paper; no sample size or annotation protocol provided in the excerpt.
Manual evaluation confirms gains in search experience quality, with +1.65% in page good rate.
Reported manual evaluation metric in the paper; no sample size or annotation protocol provided in the excerpt.
OneSearch-V2 increases order volume by +2.11% in online A/B tests.
Reported online A/B test result in the paper; no sample size, test duration, or statistical significance reported in the excerpt.
OneSearch-V2 increases buyer conversion rate by +3.05% in online A/B tests.
Reported online A/B test result in the paper; no sample size, test duration, or statistical significance reported in the excerpt.
OneSearch-V2 increases item CTR by +3.98% in online A/B tests.
Reported online A/B test result in the paper; no sample size, test duration, or statistical significance reported in the excerpt.
OneSearch, as a representative industrial-scale deployed generative search framework, has brought significant commercial and operational benefits.
Author assertion describing OneSearch as industrial-scale and commercially/operationally beneficial; no supporting numerical evidence or sample size reported in the excerpt.
high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... commercial and operational benefits
Generative Retrieval (GR) offers advantages over multi-stage cascaded architectures such as end-to-end joint optimization and high computational efficiency.
Statement in paper positioning GR as a promising paradigm and listing these advantages; no quantitative study or sample size reported in the excerpt.
high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... computational efficiency and ability to perform end-to-end joint optimization
Automation in Japanese manufacturing increased even during periods of slow productivity growth.
Empirical finding from applying the framework to industry-level data in Japanese manufacturing; comparison of inferred automation trends with observed productivity growth periods (exact sample/time not provided in the summary).
high positive The macroeconomics of automation trend in automation versus productivity growth (automation increased despite slo...
Applying the framework to Japanese manufacturing industries shows that automation increased through capital deepening.
Empirical application of the theoretical framework to Japanese manufacturing industries (industry-level analysis); estimation/inference using industry macro observables. (Paper states result; exact sample size/time span not provided in the summary.)
high positive The macroeconomics of automation increase in automation (share of tasks by capital) attributable to capital deepe...
The model provides a transparent mapping from standard macroeconomic observables (capital-labor ratio, output per worker, elasticity of substitution) into the degree of automation, allowing automation to be measured without relying on technology-specific indicators.
Theoretical mapping derived from the CES structure that links observable macro variables to the endogenous degree of automation; methodological claim about inference procedure.
high positive The macroeconomics of automation degree of automation inferred from macro observables
Aggregating task-level decisions generates a CES production function in which the economy-wide degree of automation emerges endogenously.
Analytical derivation in the paper: aggregation of task-level adoption decisions yields a CES aggregate production function with endogenous automation parameter.
high positive The macroeconomics of automation form of aggregate production function / emergence of economy-wide automation par...
The degree of automation is defined as the share of tasks performed by capital rather than labor.
Explicit model definition provided in the paper (conceptual/theoretical definition).
high positive The macroeconomics of automation share of tasks performed by capital
The degree of automation in the aggregate economy emerges endogenously as an equilibrium outcome and can be inferred from standard macroeconomic data.
Theoretical development in a task-based production framework with endogenous technology adoption; mapping from model to observable macro variables (capital-labor ratio, output per worker, elasticity of substitution).
high positive The macroeconomics of automation degree of automation (economy-wide share of tasks performed by capital)
These results demonstrate a practical path toward high-precision, low-latency text-to-SQL applications using domain-specialized, self-hosted language models in large-scale production environments.
Conclusion drawn by the authors based on their implementation, token reduction, and reported accuracy/latency-related claims; generalization to large-scale production is asserted but not supported by detailed production deployment metrics in the excerpt.
high positive Schema on the Inside: A Two-Phase Fine-Tuning Method for Hig... feasibility of production-grade text-to-SQL (precision and latency)
The resulting system achieves 98.4% execution success and 92.5% semantic accuracy, substantially outperforming a prompt-engineered baseline using Google's Gemini Flash 2.0 (95.6% execution, 89.4% semantic accuracy).
Reported empirical evaluation comparing the authors' system to a prompt-engineered baseline (Gemini Flash 2.0) with explicit performance percentages for execution success and semantic accuracy; no sample size, test set composition, statistical significance, or evaluation protocol provided in the excerpt.
high positive Schema on the Inside: A Two-Phase Fine-Tuning Method for Hig... execution success rate; semantic accuracy
The approach replaces costly external API calls with efficient local inference.
System design claim: the model is self-hosted and performs local inference instead of using external API-based LLM calls; no cost accounting or latency benchmarks provided in the excerpt.
high positive Schema on the Inside: A Two-Phase Fine-Tuning Method for Hig... use of external API calls vs local inference (cost/efficiency implication)
This reduces input tokens by over 99%, from a 17k-token baseline to fewer than 100.
Reported measurement comparing input token counts before and after applying their approach (explicit numerical baseline and resulting counts provided); no sample size or distribution of token counts reported.
A novel two-phase supervised fine-tuning approach enables the model to internalize the entire database schema, eliminating the need for long-context prompts.
Methodological description (two-phase supervised fine-tuning) and claim that this internalization removes reliance on long-context prompts; no detailed experimental protocol or sample size provided in the excerpt.
high positive Schema on the Inside: A Two-Phase Fine-Tuning Method for Hig... need for long-context prompts / model internalization of schema