Evidence (6491 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
There is consistent evidence of productivity improvements from generative AI in workplace settings, driven by task automation, decision support, and knowledge augmentation.
Synthesis of findings across the 40 included empirical and conceptual studies (review-level conclusion summarising multiple studies reporting productivity effects).
Under the concurrent AI-assisted decision-making paradigm, the explanatory interface of the AI system significantly improves immediate task performance.
Randomized controlled experiment comparing concurrent vs sequential paradigms and presence/absence of explanatory interface; statistical test reported as 'significantly improves' immediate task performance under concurrent paradigm (N=120 total).
Effective AI governance requires stronger policy capacity, clearer allocation of responsibility, and governance mechanisms that remain robust across divergent technological futures.
Conclusion of the article based on its analysis of uncertainty, adoption dynamics, and framework proposals; grounded in cited policy and scholarly sources.
The article proposes an adaptive governance framework for public institutions that integrates capability monitoring, risk tiering, conditional controls, institutional learning, and standards-based interoperability.
Normative framework proposed in the article, derived from the paper's synthesis of foresight reports and governance scholarship.
The article reconstructs the conceptual foundations of the 'evidence dilemma', differentiated AI risk categories, and the limits of prediction.
Declared analytic activity in the article, based on synthesis of the International AI Safety Report 2026, OECD foresight, and recent scholarship.
Public governance for frontier AI should be based on adaptive risk management, scenario-aware regulation, and sociotechnical transformation rather than static compliance models.
Normative recommendation made by the article, supported by conceptual analysis and references to adaptive governance literature and policy documents.
Recent evidence indicates that AI capabilities are advancing rapidly, though unevenly.
Statement in article referencing recent empirical/foresight sources, e.g. International AI Safety Report 2026 and OECD foresight documents (sources cited in the paper).
The governance of frontier general-purpose artificial intelligence has become a public-sector problem of institutional design, not merely a technical issue of model performance.
Conceptual argument presented in the article, drawing on synthesis of policy reports (International AI Safety Report 2026, OECD foresight) and scholarship in digital government.
We present a gaze-grounded multimodal LLM assistant that uses egocentric video with gaze overlays to identify likely points of difficulty and target follow-up retrospective assistance.
System description and implementation presented in the paper: an assistant combining egocentric video and gaze overlays to detect potential user difficulties and provide retrospective help.
Gaze-aware LLM assistants can reason about cognitive needs to improve cognitive outcomes of users.
Authors' synthesis and interpretation of controlled-study results (n=36) showing improved recall, perceived accuracy/personalization, and more efficient interactions under the gaze-aware condition.
Users spoke significantly fewer words with the gaze-aware assistant, indicating more efficient interactions.
Behavioral measure recorded during the controlled study (n=36): word count of user speech in gaze-aware vs text-only conditions; authors report a statistically significant reduction in words spoken in the gaze-aware condition.
The gaze-aware assistant significantly improved people's ability to recall information.
Controlled study (n=36) comparing recall performance between gaze-aware and text-only assistant conditions; authors report a statistically significant improvement in recall for the gaze-aware condition.
Compared to a conventional LLM assistant, the gaze-aware assistant was rated as significantly more personalized in its assessments of users' reading behavior.
Between-subjects controlled study (n=36) using user ratings of personalization for the gaze-aware vs text-only assistant; authors report a statistically significant increase in perceived personalization for the gaze-aware condition.
Compared to a conventional LLM assistant, the gaze-aware assistant was rated as significantly more accurate in its assessments of users' reading behavior.
Between-subjects controlled study (n=36) comparing user ratings of the gaze-aware assistant vs a text-only LLM; authors report a statistically significant difference in perceived accuracy of assessments.
By extending traditional technology acceptance models (TAM) with AI-specific dimensions—namely transparency, data quality, and trust—this study contributes to the literature on decision-making in complex systems and offers practical insights for organizations seeking to improve decision effectiveness through AI-based support.
Authors' stated contribution in abstract/introduction; conceptual model extension and empirical tests reported in the paper (survey N = 324 and PLS-SEM results).
Intention to adopt AI-DSS demonstrates a strong association with decision-making efficiency (β = 0.544, p < 0.001).
PLS-SEM path coefficient reported in results (β = 0.544, p < 0.001) linking intention to adopt and decision-making efficiency, estimated from survey data (N = 324).
Perceived usefulness (β = 0.352, p < 0.001), trust (β = 0.311, p < 0.001), and perceived ease of use (β = 0.135, p < 0.05) exert significant positive effects on the intention to adopt AI-DSS.
PLS-SEM path coefficients and significance levels reported for predictors of intention to adopt, based on the questionnaire sample (N = 324).
Perceived ease of use significantly affects perceived usefulness (β = 0.597, p < 0.001).
PLS-SEM estimate reported in paper (β = 0.597, p < 0.001) from the survey of 324 respondents.
Trust positively influences perceived ease of use of AI-DSS (β = 0.482, p < 0.001).
PLS-SEM path coefficient reported in results (β = 0.482, p < 0.001) based on the questionnaire sample (N = 324).
Trust positively influences perceived usefulness of AI-DSS (β = 0.229, p < 0.01).
PLS-SEM path coefficient reported in results (β = 0.229, p < 0.01) from the survey data (N = 324).
Data transparency and quality strongly enhance trust in AI-based decision support systems (AI-DSS) (β = 0.784, p < 0.001).
PLS-SEM estimate reported in results (standardized path coefficient β = 0.784, p < 0.001) based on the survey of 324 respondents.
Evidence-based frameworks for structural redesign that prioritize network density, decision proximity to information sources, and cross-boundary coordination mechanisms are foundational prerequisites for organizational agility.
Concluding synthesis of reviewed literature and empirical cases leading to proposed frameworks. The provided text labels the frameworks 'evidence-based' but does not present quantitative validation or implementation trial results in the excerpt.
The article draws on empirical cases from manufacturing, technology platforms, and healthcare delivery across North America, Europe, and East Asia to support its arguments.
Statement in the article that empirical cases from those sectors and regions were analyzed. The provided text does not specify the number of cases, selection criteria, or methodologies for the case analyses.
Structural reconfiguration enables adaptive behaviors that resist cultivation under traditional pyramid architectures, regardless of cultural interventions.
Claim derived from comparative analysis and empirical case studies referenced in the article; presented as an observation across cases from multiple industries and regions. No explicit statistical tests or counts reported in the provided text.
Flattening hierarchies and redistributing authority to operational edges fundamentally rewires information flow, decision velocity, and collaborative patterns.
Argument based on synthesis of research on organizational modularity and structural determinants of behavior; described as supported by empirical cases across sectors (manufacturing, technology platforms, healthcare). No numerical sample sizes or formal experimental details provided.
Formal structure—specifically hierarchical configuration and decision-making architecture—exerts greater influence on employee behavior than culture change initiatives or compensation redesign.
Synthesis of organizational behavior, network science, and comparative institutional research cited in the article; stated comparison between structural determinants and culture/incentive interventions. No sample size or statistical details reported in the text provided.
Experimental evidence confirms that AI tools raise worker productivity.
Statement in paper referencing experimental studies (no specific study, method, or sample size reported in the excerpt).
A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects.
Paper describes an interception layer in the evaluation infrastructure that prevents actual final submissions on production sites.
Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction.
Methodological description in the paper: evaluation occurs on live (production) websites rather than offline static sandboxes; supported by reported coverage of 144 live platforms.
The tasks in ClawBench require demanding capabilities beyond existing benchmarks, such as extracting relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and completing write-heavy operations like filling many detailed forms correctly.
Paper description of task types and the capabilities they require; based on the design and composition of the 153 tasks.
ClawBench spans 144 live platforms across 15 categories.
Paper explicitly reports coverage across 144 production websites and 15 task categories (dataset description).
ClawBench is an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work.
Paper states the benchmark comprises 153 tasks (dataset description).
When used appropriately, LLMs are powerful tools that can expand the frontier of empirical economics.
Normative conclusion in the abstract based on the paper's proposed framework and discussion; presented as an overall benefit but not supported by empirical outcomes or quantified gains in the excerpt.
For estimation problems—automating the measurement of economic concepts for downstream analysis—valid downstream inference requires combining LLM outputs with a small validation sample to deliver consistent and precise estimates.
Methodological claim in the abstract advocating use of a small validation sample together with LLM outputs to achieve consistent/precise estimates; no empirical demonstration or sample-size specification provided in the excerpt.
The paper provides an econometric framework for realizing the potential of LLMs in two empirical uses: prediction problems and estimation problems.
Claim of contribution in the abstract describing a methodological framework (the excerpt reports the existence of the framework but does not detail empirical validation or sample sizes).
Researchers can now revisit old questions and tackle novel ones with rich data using LLMs.
Asserted in the paper's abstract as a consequence of LLM-enabled large-scale text analysis; no empirical demonstration or quantified case described in the excerpt.
Large language models (LLMs) enable researchers to analyze text at unprecedented scale and minimal cost.
Stated as an assertion in the paper's abstract/summary; based on the authors' framing of LLM capabilities (no empirical sample, experiment, or quantified result provided in the excerpt).
All data, code, and model responses are open-sourced.
Statement in the paper asserting that data, code, and model outputs are publicly released.
78.7% of observed AI interactions are augmentation, not automation.
Empirical classification of AI interactions (from cross-referenced Anthropic Economic Index interactions/tasks) reported as a percentage in the paper.
The study cross-references the SAFI benchmark with real-world AI adoption data from the Anthropic Economic Index covering 756 occupations and 17,998 tasks.
Data linkage described in the paper: use of Anthropic Economic Index as real-world AI adoption dataset (numbers reported in text).
The benchmark covers 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy.
Reported dataset construction in the paper: 263 tasks mapped to 35 O*NET skills.
We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs -- LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash -- across 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy (1,052 total model calls, 0% failure rate).
Empirical benchmark executed by the authors: 263 text-based tasks mapped to 35 O*NET skills, 4 LLMs, 1,052 total model calls reported, and reported 0% failure rate.
The paper argues for a fundamental decoupling of semantic intent from human-readable representation.
Conceptual/design claim made by the authors as a recommended shift in representation strategy for agentic consumers; presented as argumentation rather than empirically tested in abstract.
We extend the semantic density principle to propose rehabilitation of classical anti-patterns and introduce the program skeleton concept for agentic code navigation.
Design/position claims and proposed constructs presented in the paper (program skeleton concept and re-evaluation of anti-patterns) without empirical validation reported in abstract.
Aggressive compression reduced input tokens by 17%.
Reported numeric result from the controlled experiment comparing compressed logs to other conditions; sample size not specified in abstract.
We propose a key design principle: semantic density optimization, eliminating tokens that carry zero information while preserving tokens that carry high semantic value.
Proposal/design principle presented in the paper; theoretical justification provided and (per paper) subsequently validated by experiment.
ImplicitMemBench reframes evaluation from 'what agents recall' to 'what they automatically enact'.
Paper framing statement positioning the benchmark's conceptual contribution as shifting evaluation focus to implicit, automatic behavior rather than explicit recall.
Top performers were DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%).
Paper lists top model names with reported overall percentage scores from the benchmark evaluation.
The benchmark's 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring.
Paper states the suite size (300 items) and describes a unified Learning/Priming-Interfere-Test protocol and that scoring is done on first attempts.
ImplicitMemBench operationalizes three cognitively grounded constructs from cognitive science: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (CS--US associations shaping first decisions).
Paper description of benchmark design explicitly listing the three constructs and brief operational definitions for each.