The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (6491 claims)

Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 758 199 100 900 2007
Governance & Regulation 826 400 191 122 1563
Organizational Efficiency 777 193 124 84 1189
Technology Adoption Rate 635 233 124 97 1098
Research Productivity 422 128 57 336 954
Output Quality 476 179 59 47 761
Decision Quality 328 177 81 47 640
Firm Productivity 435 57 88 20 606
AI Safety & Ethics 218 277 65 33 599
Market Structure 180 170 123 24 502
Task Allocation 213 64 72 33 387
Skill Acquisition 170 61 61 17 309
Innovation Output 203 27 43 18 292
Employment Level 105 54 107 13 281
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 117 63 42 11 233
Firm Revenue 153 48 26 3 230
Task Completion Time 173 31 8 12 225
Inequality Measures 44 122 49 6 221
Worker Satisfaction 89 65 22 12 188
Error Rate 69 92 10 2 173
Regulatory Compliance 77 69 14 5 165
Automation Exposure 56 56 26 13 154
Training Effectiveness 94 21 13 19 149
Wages & Compensation 77 36 25 6 144
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 80 20 1 113
Hiring & Recruitment 52 7 8 3 70
Creative Output 31 18 8 3 61
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Human Ai Collab Remove filter
There is consistent evidence of productivity improvements from generative AI in workplace settings, driven by task automation, decision support, and knowledge augmentation.
Synthesis of findings across the 40 included empirical and conceptual studies (review-level conclusion summarising multiple studies reporting productivity effects).
high positive Generative AI in the Workplace: A Systematic Review of Produ... productivity improvements (via task automation, decision support, knowledge augm...
Under the concurrent AI-assisted decision-making paradigm, the explanatory interface of the AI system significantly improves immediate task performance.
Randomized controlled experiment comparing concurrent vs sequential paradigms and presence/absence of explanatory interface; statistical test reported as 'significantly improves' immediate task performance under concurrent paradigm (N=120 total).
high positive How AI-Assisted Decision-Making Paradigms and Explainability... immediate task performance (task execution stage)
Effective AI governance requires stronger policy capacity, clearer allocation of responsibility, and governance mechanisms that remain robust across divergent technological futures.
Conclusion of the article based on its analysis of uncertainty, adoption dynamics, and framework proposals; grounded in cited policy and scholarly sources.
high positive Governing frontier general-purpose AI in the public sector: ... requirements for effective AI governance (policy capacity, responsibility alloca...
The article proposes an adaptive governance framework for public institutions that integrates capability monitoring, risk tiering, conditional controls, institutional learning, and standards-based interoperability.
Normative framework proposed in the article, derived from the paper's synthesis of foresight reports and governance scholarship.
high positive Governing frontier general-purpose AI in the public sector: ... components and design of an adaptive governance framework for AI
The article reconstructs the conceptual foundations of the 'evidence dilemma', differentiated AI risk categories, and the limits of prediction.
Declared analytic activity in the article, based on synthesis of the International AI Safety Report 2026, OECD foresight, and recent scholarship.
high positive Governing frontier general-purpose AI in the public sector: ... conceptual framing of evidence gaps, AI risk typology, and prediction limits
Public governance for frontier AI should be based on adaptive risk management, scenario-aware regulation, and sociotechnical transformation rather than static compliance models.
Normative recommendation made by the article, supported by conceptual analysis and references to adaptive governance literature and policy documents.
high positive Governing frontier general-purpose AI in the public sector: ... preferred governance approach for frontier AI
Recent evidence indicates that AI capabilities are advancing rapidly, though unevenly.
Statement in article referencing recent empirical/foresight sources, e.g. International AI Safety Report 2026 and OECD foresight documents (sources cited in the paper).
high positive Governing frontier general-purpose AI in the public sector: ... rate and distribution of AI capability advancement
The governance of frontier general-purpose artificial intelligence has become a public-sector problem of institutional design, not merely a technical issue of model performance.
Conceptual argument presented in the article, drawing on synthesis of policy reports (International AI Safety Report 2026, OECD foresight) and scholarship in digital government.
high positive Governing frontier general-purpose AI in the public sector: ... public-sector institutional design requirements for frontier AI governance
We present a gaze-grounded multimodal LLM assistant that uses egocentric video with gaze overlays to identify likely points of difficulty and target follow-up retrospective assistance.
System description and implementation presented in the paper: an assistant combining egocentric video and gaze overlays to detect potential user difficulties and provide retrospective help.
high positive From Gaze to Guidance: Interpreting and Adapting to Users' C... system capability (gaze-grounded multimodal assistance)
Gaze-aware LLM assistants can reason about cognitive needs to improve cognitive outcomes of users.
Authors' synthesis and interpretation of controlled-study results (n=36) showing improved recall, perceived accuracy/personalization, and more efficient interactions under the gaze-aware condition.
high positive From Gaze to Guidance: Interpreting and Adapting to Users' C... cognitive outcomes (e.g., recall) and reasoning about cognitive needs
Users spoke significantly fewer words with the gaze-aware assistant, indicating more efficient interactions.
Behavioral measure recorded during the controlled study (n=36): word count of user speech in gaze-aware vs text-only conditions; authors report a statistically significant reduction in words spoken in the gaze-aware condition.
high positive From Gaze to Guidance: Interpreting and Adapting to Users' C... number of words spoken by users (conversational length/effort)
The gaze-aware assistant significantly improved people's ability to recall information.
Controlled study (n=36) comparing recall performance between gaze-aware and text-only assistant conditions; authors report a statistically significant improvement in recall for the gaze-aware condition.
high positive From Gaze to Guidance: Interpreting and Adapting to Users' C... information recall (memory performance)
Compared to a conventional LLM assistant, the gaze-aware assistant was rated as significantly more personalized in its assessments of users' reading behavior.
Between-subjects controlled study (n=36) using user ratings of personalization for the gaze-aware vs text-only assistant; authors report a statistically significant increase in perceived personalization for the gaze-aware condition.
high positive From Gaze to Guidance: Interpreting and Adapting to Users' C... perceived personalization of assistant assessments
Compared to a conventional LLM assistant, the gaze-aware assistant was rated as significantly more accurate in its assessments of users' reading behavior.
Between-subjects controlled study (n=36) comparing user ratings of the gaze-aware assistant vs a text-only LLM; authors report a statistically significant difference in perceived accuracy of assessments.
high positive From Gaze to Guidance: Interpreting and Adapting to Users' C... perceived accuracy of assistant assessments of reading behavior
By extending traditional technology acceptance models (TAM) with AI-specific dimensions—namely transparency, data quality, and trust—this study contributes to the literature on decision-making in complex systems and offers practical insights for organizations seeking to improve decision effectiveness through AI-based support.
Authors' stated contribution in abstract/introduction; conceptual model extension and empirical tests reported in the paper (survey N = 324 and PLS-SEM results).
high positive Decision-Making in Complex Systems Using AI-Based Decision S... conceptual/methodological contribution and practical insights
Intention to adopt AI-DSS demonstrates a strong association with decision-making efficiency (β = 0.544, p < 0.001).
PLS-SEM path coefficient reported in results (β = 0.544, p < 0.001) linking intention to adopt and decision-making efficiency, estimated from survey data (N = 324).
high positive Decision-Making in Complex Systems Using AI-Based Decision S... decision-making efficiency
Perceived usefulness (β = 0.352, p < 0.001), trust (β = 0.311, p < 0.001), and perceived ease of use (β = 0.135, p < 0.05) exert significant positive effects on the intention to adopt AI-DSS.
PLS-SEM path coefficients and significance levels reported for predictors of intention to adopt, based on the questionnaire sample (N = 324).
high positive Decision-Making in Complex Systems Using AI-Based Decision S... intention to adopt AI-DSS
Perceived ease of use significantly affects perceived usefulness (β = 0.597, p < 0.001).
PLS-SEM estimate reported in paper (β = 0.597, p < 0.001) from the survey of 324 respondents.
high positive Decision-Making in Complex Systems Using AI-Based Decision S... perceived usefulness of AI-DSS
Trust positively influences perceived ease of use of AI-DSS (β = 0.482, p < 0.001).
PLS-SEM path coefficient reported in results (β = 0.482, p < 0.001) based on the questionnaire sample (N = 324).
high positive Decision-Making in Complex Systems Using AI-Based Decision S... perceived ease of use of AI-DSS
Trust positively influences perceived usefulness of AI-DSS (β = 0.229, p < 0.01).
PLS-SEM path coefficient reported in results (β = 0.229, p < 0.01) from the survey data (N = 324).
high positive Decision-Making in Complex Systems Using AI-Based Decision S... perceived usefulness of AI-DSS
Data transparency and quality strongly enhance trust in AI-based decision support systems (AI-DSS) (β = 0.784, p < 0.001).
PLS-SEM estimate reported in results (standardized path coefficient β = 0.784, p < 0.001) based on the survey of 324 respondents.
high positive Decision-Making in Complex Systems Using AI-Based Decision S... trust in AI-based decision support systems
Evidence-based frameworks for structural redesign that prioritize network density, decision proximity to information sources, and cross-boundary coordination mechanisms are foundational prerequisites for organizational agility.
Concluding synthesis of reviewed literature and empirical cases leading to proposed frameworks. The provided text labels the frameworks 'evidence-based' but does not present quantitative validation or implementation trial results in the excerpt.
The article draws on empirical cases from manufacturing, technology platforms, and healthcare delivery across North America, Europe, and East Asia to support its arguments.
Statement in the article that empirical cases from those sectors and regions were analyzed. The provided text does not specify the number of cases, selection criteria, or methodologies for the case analyses.
high positive People Don't Follow Strategy—They Follow Structure: Why Orga... breadth of empirical support (cross-sector, cross-region cases)
Structural reconfiguration enables adaptive behaviors that resist cultivation under traditional pyramid architectures, regardless of cultural interventions.
Claim derived from comparative analysis and empirical case studies referenced in the article; presented as an observation across cases from multiple industries and regions. No explicit statistical tests or counts reported in the provided text.
high positive People Don't Follow Strategy—They Follow Structure: Why Orga... adaptive behaviors / organizational adaptability
Flattening hierarchies and redistributing authority to operational edges fundamentally rewires information flow, decision velocity, and collaborative patterns.
Argument based on synthesis of research on organizational modularity and structural determinants of behavior; described as supported by empirical cases across sectors (manufacturing, technology platforms, healthcare). No numerical sample sizes or formal experimental details provided.
high positive People Don't Follow Strategy—They Follow Structure: Why Orga... information flow, decision velocity, collaborative patterns
Formal structure—specifically hierarchical configuration and decision-making architecture—exerts greater influence on employee behavior than culture change initiatives or compensation redesign.
Synthesis of organizational behavior, network science, and comparative institutional research cited in the article; stated comparison between structural determinants and culture/incentive interventions. No sample size or statistical details reported in the text provided.
Experimental evidence confirms that AI tools raise worker productivity.
Statement in paper referencing experimental studies (no specific study, method, or sample size reported in the excerpt).
A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects.
Paper describes an interception layer in the evaluation infrastructure that prevents actual final submissions on production sites.
high positive ClawBench: Can AI Agents Complete Everyday Online Tasks? evaluation_safety (prevention of real-world side effects)
Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction.
Methodological description in the paper: evaluation occurs on live (production) websites rather than offline static sandboxes; supported by reported coverage of 144 live platforms.
high positive ClawBench: Can AI Agents Complete Everyday Online Tasks? evaluation_realism / fidelity to real-world interactions
The tasks in ClawBench require demanding capabilities beyond existing benchmarks, such as extracting relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and completing write-heavy operations like filling many detailed forms correctly.
Paper description of task types and the capabilities they require; based on the design and composition of the 153 tasks.
high positive ClawBench: Can AI Agents Complete Everyday Online Tasks? task_complexity / capability_requirements
ClawBench spans 144 live platforms across 15 categories.
Paper explicitly reports coverage across 144 production websites and 15 task categories (dataset description).
high positive ClawBench: Can AI Agents Complete Everyday Online Tasks? benchmark_scope (platforms and categories)
ClawBench is an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work.
Paper states the benchmark comprises 153 tasks (dataset description).
high positive ClawBench: Can AI Agents Complete Everyday Online Tasks? benchmark_scope (number of tasks)
When used appropriately, LLMs are powerful tools that can expand the frontier of empirical economics.
Normative conclusion in the abstract based on the paper's proposed framework and discussion; presented as an overall benefit but not supported by empirical outcomes or quantified gains in the excerpt.
high positive Large Language Models: An Applied Econometric Framework expansion of empirical economics research capabilities
For estimation problems—automating the measurement of economic concepts for downstream analysis—valid downstream inference requires combining LLM outputs with a small validation sample to deliver consistent and precise estimates.
Methodological claim in the abstract advocating use of a small validation sample together with LLM outputs to achieve consistent/precise estimates; no empirical demonstration or sample-size specification provided in the excerpt.
high positive Large Language Models: An Applied Econometric Framework consistency and precision of downstream estimates derived from LLM-measured vari...
The paper provides an econometric framework for realizing the potential of LLMs in two empirical uses: prediction problems and estimation problems.
Claim of contribution in the abstract describing a methodological framework (the excerpt reports the existence of the framework but does not detail empirical validation or sample sizes).
high positive Large Language Models: An Applied Econometric Framework methodological framework for empirical use of LLMs
Researchers can now revisit old questions and tackle novel ones with rich data using LLMs.
Asserted in the paper's abstract as a consequence of LLM-enabled large-scale text analysis; no empirical demonstration or quantified case described in the excerpt.
high positive Large Language Models: An Applied Econometric Framework ability to (re)address research questions using textual data
Large language models (LLMs) enable researchers to analyze text at unprecedented scale and minimal cost.
Stated as an assertion in the paper's abstract/summary; based on the authors' framing of LLM capabilities (no empirical sample, experiment, or quantified result provided in the excerpt).
high positive Large Language Models: An Applied Econometric Framework ability to analyze text at scale and cost
All data, code, and model responses are open-sourced.
Statement in the paper asserting that data, code, and model outputs are publicly released.
high positive The AI Skills Shift: Mapping Skill Obsolescence, Emergence, ... availability of study materials (data, code, responses)
78.7% of observed AI interactions are augmentation, not automation.
Empirical classification of AI interactions (from cross-referenced Anthropic Economic Index interactions/tasks) reported as a percentage in the paper.
high positive The AI Skills Shift: Mapping Skill Obsolescence, Emergence, ... share of AI interactions classified as augmentation vs automation
The study cross-references the SAFI benchmark with real-world AI adoption data from the Anthropic Economic Index covering 756 occupations and 17,998 tasks.
Data linkage described in the paper: use of Anthropic Economic Index as real-world AI adoption dataset (numbers reported in text).
high positive The AI Skills Shift: Mapping Skill Obsolescence, Emergence, ... occupations and tasks coverage in cross-reference dataset
The benchmark covers 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy.
Reported dataset construction in the paper: 263 tasks mapped to 35 O*NET skills.
high positive The AI Skills Shift: Mapping Skill Obsolescence, Emergence, ... coverage of O*NET skills by benchmark tasks
We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs -- LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash -- across 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy (1,052 total model calls, 0% failure rate).
Empirical benchmark executed by the authors: 263 text-based tasks mapped to 35 O*NET skills, 4 LLMs, 1,052 total model calls reported, and reported 0% failure rate.
high positive The AI Skills Shift: Mapping Skill Obsolescence, Emergence, ... benchmark coverage and execution success (model calls and failure rate)
The paper argues for a fundamental decoupling of semantic intent from human-readable representation.
Conceptual/design claim made by the authors as a recommended shift in representation strategy for agentic consumers; presented as argumentation rather than empirically tested in abstract.
high positive Beyond Human-Readable: Rethinking Software Engineering Conve... alignment between semantic intent encoding and human-readable formats
We extend the semantic density principle to propose rehabilitation of classical anti-patterns and introduce the program skeleton concept for agentic code navigation.
Design/position claims and proposed constructs presented in the paper (program skeleton concept and re-evaluation of anti-patterns) without empirical validation reported in abstract.
high positive Beyond Human-Readable: Rethinking Software Engineering Conve... suitability of classical anti-patterns and program skeletons for agentic navigat...
Aggressive compression reduced input tokens by 17%.
Reported numeric result from the controlled experiment comparing compressed logs to other conditions; sample size not specified in abstract.
We propose a key design principle: semantic density optimization, eliminating tokens that carry zero information while preserving tokens that carry high semantic value.
Proposal/design principle presented in the paper; theoretical justification provided and (per paper) subsequently validated by experiment.
high positive Beyond Human-Readable: Rethinking Software Engineering Conve... information/content efficiency of token representations for agentic consumers
ImplicitMemBench reframes evaluation from 'what agents recall' to 'what they automatically enact'.
Paper framing statement positioning the benchmark's conceptual contribution as shifting evaluation focus to implicit, automatic behavior rather than explicit recall.
high positive ImplicitMemBench: Measuring Unconscious Behavioral Adaptatio... evaluation framing / measurement focus
Top performers were DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%).
Paper lists top model names with reported overall percentage scores from the benchmark evaluation.
high positive ImplicitMemBench: Measuring Unconscious Behavioral Adaptatio... overall accuracy on the implicit memory benchmark
The benchmark's 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring.
Paper states the suite size (300 items) and describes a unified Learning/Priming-Interfere-Test protocol and that scoring is done on first attempts.
ImplicitMemBench operationalizes three cognitively grounded constructs from cognitive science: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (CS--US associations shaping first decisions).
Paper description of benchmark design explicitly listing the three constructs and brief operational definitions for each.