Evidence (6491 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

There is consistent evidence of productivity improvements from generative AI in workplace settings, driven by task automation, decision support, and knowledge augmentation.

Synthesis of findings across the 40 included empirical and conceptual studies (review-level conclusion summarising multiple studies reporting productivity effects).

high positive Generative AI in the Workplace: A Systematic Review of Produ... productivity improvements (via task automation, decision support, knowledge augm...

Under the concurrent AI-assisted decision-making paradigm, the explanatory interface of the AI system significantly improves immediate task performance.

Randomized controlled experiment comparing concurrent vs sequential paradigms and presence/absence of explanatory interface; statistical test reported as 'significantly improves' immediate task performance under concurrent paradigm (N=120 total).

high positive How AI-Assisted Decision-Making Paradigms and Explainability... immediate task performance (task execution stage)

Effective AI governance requires stronger policy capacity, clearer allocation of responsibility, and governance mechanisms that remain robust across divergent technological futures.

Conclusion of the article based on its analysis of uncertainty, adoption dynamics, and framework proposals; grounded in cited policy and scholarly sources.

high positive Governing frontier general-purpose AI in the public sector: ... requirements for effective AI governance (policy capacity, responsibility alloca...

The article proposes an adaptive governance framework for public institutions that integrates capability monitoring, risk tiering, conditional controls, institutional learning, and standards-based interoperability.

Normative framework proposed in the article, derived from the paper's synthesis of foresight reports and governance scholarship.

high positive Governing frontier general-purpose AI in the public sector: ... components and design of an adaptive governance framework for AI

The article reconstructs the conceptual foundations of the 'evidence dilemma', differentiated AI risk categories, and the limits of prediction.

Declared analytic activity in the article, based on synthesis of the International AI Safety Report 2026, OECD foresight, and recent scholarship.

high positive Governing frontier general-purpose AI in the public sector: ... conceptual framing of evidence gaps, AI risk typology, and prediction limits

Public governance for frontier AI should be based on adaptive risk management, scenario-aware regulation, and sociotechnical transformation rather than static compliance models.

Normative recommendation made by the article, supported by conceptual analysis and references to adaptive governance literature and policy documents.

high positive Governing frontier general-purpose AI in the public sector: ... preferred governance approach for frontier AI

Recent evidence indicates that AI capabilities are advancing rapidly, though unevenly.

Statement in article referencing recent empirical/foresight sources, e.g. International AI Safety Report 2026 and OECD foresight documents (sources cited in the paper).

high positive Governing frontier general-purpose AI in the public sector: ... rate and distribution of AI capability advancement

The governance of frontier general-purpose artificial intelligence has become a public-sector problem of institutional design, not merely a technical issue of model performance.

Conceptual argument presented in the article, drawing on synthesis of policy reports (International AI Safety Report 2026, OECD foresight) and scholarship in digital government.

high positive Governing frontier general-purpose AI in the public sector: ... public-sector institutional design requirements for frontier AI governance

We present a gaze-grounded multimodal LLM assistant that uses egocentric video with gaze overlays to identify likely points of difficulty and target follow-up retrospective assistance.

System description and implementation presented in the paper: an assistant combining egocentric video and gaze overlays to detect potential user difficulties and provide retrospective help.

high positive From Gaze to Guidance: Interpreting and Adapting to Users' C... system capability (gaze-grounded multimodal assistance)

Gaze-aware LLM assistants can reason about cognitive needs to improve cognitive outcomes of users.

Authors' synthesis and interpretation of controlled-study results (n=36) showing improved recall, perceived accuracy/personalization, and more efficient interactions under the gaze-aware condition.

high positive From Gaze to Guidance: Interpreting and Adapting to Users' C... cognitive outcomes (e.g., recall) and reasoning about cognitive needs

Users spoke significantly fewer words with the gaze-aware assistant, indicating more efficient interactions.

Behavioral measure recorded during the controlled study (n=36): word count of user speech in gaze-aware vs text-only conditions; authors report a statistically significant reduction in words spoken in the gaze-aware condition.

high positive From Gaze to Guidance: Interpreting and Adapting to Users' C... number of words spoken by users (conversational length/effort)

The gaze-aware assistant significantly improved people's ability to recall information.

Controlled study (n=36) comparing recall performance between gaze-aware and text-only assistant conditions; authors report a statistically significant improvement in recall for the gaze-aware condition.

high positive From Gaze to Guidance: Interpreting and Adapting to Users' C... information recall (memory performance)

Compared to a conventional LLM assistant, the gaze-aware assistant was rated as significantly more personalized in its assessments of users' reading behavior.

Between-subjects controlled study (n=36) using user ratings of personalization for the gaze-aware vs text-only assistant; authors report a statistically significant increase in perceived personalization for the gaze-aware condition.

high positive From Gaze to Guidance: Interpreting and Adapting to Users' C... perceived personalization of assistant assessments

Compared to a conventional LLM assistant, the gaze-aware assistant was rated as significantly more accurate in its assessments of users' reading behavior.

Between-subjects controlled study (n=36) comparing user ratings of the gaze-aware assistant vs a text-only LLM; authors report a statistically significant difference in perceived accuracy of assessments.

high positive From Gaze to Guidance: Interpreting and Adapting to Users' C... perceived accuracy of assistant assessments of reading behavior

By extending traditional technology acceptance models (TAM) with AI-specific dimensions—namely transparency, data quality, and trust—this study contributes to the literature on decision-making in complex systems and offers practical insights for organizations seeking to improve decision effectiveness through AI-based support.

Authors' stated contribution in abstract/introduction; conceptual model extension and empirical tests reported in the paper (survey N = 324 and PLS-SEM results).

high positive Decision-Making in Complex Systems Using AI-Based Decision S... conceptual/methodological contribution and practical insights

Intention to adopt AI-DSS demonstrates a strong association with decision-making efficiency (β = 0.544, p < 0.001).

PLS-SEM path coefficient reported in results (β = 0.544, p < 0.001) linking intention to adopt and decision-making efficiency, estimated from survey data (N = 324).

high positive Decision-Making in Complex Systems Using AI-Based Decision S... decision-making efficiency

Perceived usefulness (β = 0.352, p < 0.001), trust (β = 0.311, p < 0.001), and perceived ease of use (β = 0.135, p < 0.05) exert significant positive effects on the intention to adopt AI-DSS.

PLS-SEM path coefficients and significance levels reported for predictors of intention to adopt, based on the questionnaire sample (N = 324).

high positive Decision-Making in Complex Systems Using AI-Based Decision S... intention to adopt AI-DSS

Perceived ease of use significantly affects perceived usefulness (β = 0.597, p < 0.001).

PLS-SEM estimate reported in paper (β = 0.597, p < 0.001) from the survey of 324 respondents.

high positive Decision-Making in Complex Systems Using AI-Based Decision S... perceived usefulness of AI-DSS

Trust positively influences perceived ease of use of AI-DSS (β = 0.482, p < 0.001).

PLS-SEM path coefficient reported in results (β = 0.482, p < 0.001) based on the questionnaire sample (N = 324).

high positive Decision-Making in Complex Systems Using AI-Based Decision S... perceived ease of use of AI-DSS

Trust positively influences perceived usefulness of AI-DSS (β = 0.229, p < 0.01).

PLS-SEM path coefficient reported in results (β = 0.229, p < 0.01) from the survey data (N = 324).

high positive Decision-Making in Complex Systems Using AI-Based Decision S... perceived usefulness of AI-DSS

Data transparency and quality strongly enhance trust in AI-based decision support systems (AI-DSS) (β = 0.784, p < 0.001).

PLS-SEM estimate reported in results (standardized path coefficient β = 0.784, p < 0.001) based on the survey of 324 respondents.

high positive Decision-Making in Complex Systems Using AI-Based Decision S... trust in AI-based decision support systems

Evidence-based frameworks for structural redesign that prioritize network density, decision proximity to information sources, and cross-boundary coordination mechanisms are foundational prerequisites for organizational agility.

Concluding synthesis of reviewed literature and empirical cases leading to proposed frameworks. The provided text labels the frameworks 'evidence-based' but does not present quantitative validation or implementation trial results in the excerpt.

high positive People Don't Follow Strategy—They Follow Structure: Why Orga... organizational agility

The article draws on empirical cases from manufacturing, technology platforms, and healthcare delivery across North America, Europe, and East Asia to support its arguments.

Statement in the article that empirical cases from those sectors and regions were analyzed. The provided text does not specify the number of cases, selection criteria, or methodologies for the case analyses.

high positive People Don't Follow Strategy—They Follow Structure: Why Orga... breadth of empirical support (cross-sector, cross-region cases)

Structural reconfiguration enables adaptive behaviors that resist cultivation under traditional pyramid architectures, regardless of cultural interventions.

Claim derived from comparative analysis and empirical case studies referenced in the article; presented as an observation across cases from multiple industries and regions. No explicit statistical tests or counts reported in the provided text.

high positive People Don't Follow Strategy—They Follow Structure: Why Orga... adaptive behaviors / organizational adaptability

Flattening hierarchies and redistributing authority to operational edges fundamentally rewires information flow, decision velocity, and collaborative patterns.

Argument based on synthesis of research on organizational modularity and structural determinants of behavior; described as supported by empirical cases across sectors (manufacturing, technology platforms, healthcare). No numerical sample sizes or formal experimental details provided.

high positive People Don't Follow Strategy—They Follow Structure: Why Orga... information flow, decision velocity, collaborative patterns

Formal structure—specifically hierarchical configuration and decision-making architecture—exerts greater influence on employee behavior than culture change initiatives or compensation redesign.

Synthesis of organizational behavior, network science, and comparative institutional research cited in the article; stated comparison between structural determinants and culture/incentive interventions. No sample size or statistical details reported in the text provided.

high positive People Don't Follow Strategy—They Follow Structure: Why Orga... employee behavior

Experimental evidence confirms that AI tools raise worker productivity.

Statement in paper referencing experimental studies (no specific study, method, or sample size reported in the excerpt).

high positive The Augmentation Trap: AI Productivity and the Cost of Cogni... worker productivity

A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects.

Paper describes an interception layer in the evaluation infrastructure that prevents actual final submissions on production sites.

high positive ClawBench: Can AI Agents Complete Everyday Online Tasks? evaluation_safety (prevention of real-world side effects)

Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction.

Methodological description in the paper: evaluation occurs on live (production) websites rather than offline static sandboxes; supported by reported coverage of 144 live platforms.

high positive ClawBench: Can AI Agents Complete Everyday Online Tasks? evaluation_realism / fidelity to real-world interactions

The tasks in ClawBench require demanding capabilities beyond existing benchmarks, such as extracting relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and completing write-heavy operations like filling many detailed forms correctly.

Paper description of task types and the capabilities they require; based on the design and composition of the 153 tasks.

high positive ClawBench: Can AI Agents Complete Everyday Online Tasks? task_complexity / capability_requirements

ClawBench spans 144 live platforms across 15 categories.

Paper explicitly reports coverage across 144 production websites and 15 task categories (dataset description).

high positive ClawBench: Can AI Agents Complete Everyday Online Tasks? benchmark_scope (platforms and categories)

ClawBench is an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work.

Paper states the benchmark comprises 153 tasks (dataset description).

high positive ClawBench: Can AI Agents Complete Everyday Online Tasks? benchmark_scope (number of tasks)

When used appropriately, LLMs are powerful tools that can expand the frontier of empirical economics.

Normative conclusion in the abstract based on the paper's proposed framework and discussion; presented as an overall benefit but not supported by empirical outcomes or quantified gains in the excerpt.

high positive Large Language Models: An Applied Econometric Framework expansion of empirical economics research capabilities

For estimation problems—automating the measurement of economic concepts for downstream analysis—valid downstream inference requires combining LLM outputs with a small validation sample to deliver consistent and precise estimates.

Methodological claim in the abstract advocating use of a small validation sample together with LLM outputs to achieve consistent/precise estimates; no empirical demonstration or sample-size specification provided in the excerpt.

high positive Large Language Models: An Applied Econometric Framework consistency and precision of downstream estimates derived from LLM-measured vari...

The paper provides an econometric framework for realizing the potential of LLMs in two empirical uses: prediction problems and estimation problems.

Claim of contribution in the abstract describing a methodological framework (the excerpt reports the existence of the framework but does not detail empirical validation or sample sizes).

high positive Large Language Models: An Applied Econometric Framework methodological framework for empirical use of LLMs

Researchers can now revisit old questions and tackle novel ones with rich data using LLMs.

Asserted in the paper's abstract as a consequence of LLM-enabled large-scale text analysis; no empirical demonstration or quantified case described in the excerpt.

high positive Large Language Models: An Applied Econometric Framework ability to (re)address research questions using textual data

Large language models (LLMs) enable researchers to analyze text at unprecedented scale and minimal cost.

Stated as an assertion in the paper's abstract/summary; based on the authors' framing of LLM capabilities (no empirical sample, experiment, or quantified result provided in the excerpt).

high positive Large Language Models: An Applied Econometric Framework ability to analyze text at scale and cost

All data, code, and model responses are open-sourced.

Statement in the paper asserting that data, code, and model outputs are publicly released.

high positive The AI Skills Shift: Mapping Skill Obsolescence, Emergence, ... availability of study materials (data, code, responses)

78.7% of observed AI interactions are augmentation, not automation.

Empirical classification of AI interactions (from cross-referenced Anthropic Economic Index interactions/tasks) reported as a percentage in the paper.

high positive The AI Skills Shift: Mapping Skill Obsolescence, Emergence, ... share of AI interactions classified as augmentation vs automation

The study cross-references the SAFI benchmark with real-world AI adoption data from the Anthropic Economic Index covering 756 occupations and 17,998 tasks.

Data linkage described in the paper: use of Anthropic Economic Index as real-world AI adoption dataset (numbers reported in text).

high positive The AI Skills Shift: Mapping Skill Obsolescence, Emergence, ... occupations and tasks coverage in cross-reference dataset

The benchmark covers 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy.

Reported dataset construction in the paper: 263 tasks mapped to 35 O*NET skills.

high positive The AI Skills Shift: Mapping Skill Obsolescence, Emergence, ... coverage of O*NET skills by benchmark tasks

We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs -- LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash -- across 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy (1,052 total model calls, 0% failure rate).

Empirical benchmark executed by the authors: 263 text-based tasks mapped to 35 O*NET skills, 4 LLMs, 1,052 total model calls reported, and reported 0% failure rate.

high positive The AI Skills Shift: Mapping Skill Obsolescence, Emergence, ... benchmark coverage and execution success (model calls and failure rate)

The paper argues for a fundamental decoupling of semantic intent from human-readable representation.

Conceptual/design claim made by the authors as a recommended shift in representation strategy for agentic consumers; presented as argumentation rather than empirically tested in abstract.

high positive Beyond Human-Readable: Rethinking Software Engineering Conve... alignment between semantic intent encoding and human-readable formats

We extend the semantic density principle to propose rehabilitation of classical anti-patterns and introduce the program skeleton concept for agentic code navigation.

Design/position claims and proposed constructs presented in the paper (program skeleton concept and re-evaluation of anti-patterns) without empirical validation reported in abstract.

high positive Beyond Human-Readable: Rethinking Software Engineering Conve... suitability of classical anti-patterns and program skeletons for agentic navigat...

Aggressive compression reduced input tokens by 17%.

Reported numeric result from the controlled experiment comparing compressed logs to other conditions; sample size not specified in abstract.

high positive Beyond Human-Readable: Rethinking Software Engineering Conve... input token count

We propose a key design principle: semantic density optimization, eliminating tokens that carry zero information while preserving tokens that carry high semantic value.

Proposal/design principle presented in the paper; theoretical justification provided and (per paper) subsequently validated by experiment.

high positive Beyond Human-Readable: Rethinking Software Engineering Conve... information/content efficiency of token representations for agentic consumers

ImplicitMemBench reframes evaluation from 'what agents recall' to 'what they automatically enact'.

Paper framing statement positioning the benchmark's conceptual contribution as shifting evaluation focus to implicit, automatic behavior rather than explicit recall.

high positive ImplicitMemBench: Measuring Unconscious Behavioral Adaptatio... evaluation framing / measurement focus

Top performers were DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%).

Paper lists top model names with reported overall percentage scores from the benchmark evaluation.

high positive ImplicitMemBench: Measuring Unconscious Behavioral Adaptatio... overall accuracy on the implicit memory benchmark

The benchmark's 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring.

Paper states the suite size (300 items) and describes a unified Learning/Priming-Interfere-Test protocol and that scoring is done on first attempts.

high positive ImplicitMemBench: Measuring Unconscious Behavioral Adaptatio... other

ImplicitMemBench operationalizes three cognitively grounded constructs from cognitive science: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (CS--US associations shaping first decisions).

Paper description of benchmark design explicitly listing the three constructs and brief operational definitions for each.

high positive ImplicitMemBench: Measuring Unconscious Behavioral Adaptatio... other

« Prev 1 2 3 … 75 76 77 … 129 130 Next »