The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (5539 claims)

Adoption
5539 claims
Productivity
4793 claims
Governance
4333 claims
Human-AI Collaboration
3326 claims
Labor Markets
2657 claims
Innovation
2510 claims
Org Design
2469 claims
Skills & Training
2017 claims
Inequality
1378 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 402 112 67 480 1076
Governance & Regulation 402 192 122 62 790
Research Productivity 249 98 34 311 697
Organizational Efficiency 395 95 70 40 603
Technology Adoption Rate 321 126 73 39 564
Firm Productivity 306 39 70 12 432
Output Quality 256 66 25 28 375
AI Safety & Ethics 116 177 44 24 363
Market Structure 107 128 85 14 339
Decision Quality 177 76 38 20 315
Fiscal & Macroeconomic 89 58 33 22 209
Employment Level 77 34 80 9 202
Skill Acquisition 92 33 40 9 174
Innovation Output 120 12 23 12 168
Firm Revenue 98 34 22 154
Consumer Welfare 73 31 37 7 148
Task Allocation 84 16 33 7 140
Inequality Measures 25 77 32 5 139
Regulatory Compliance 54 63 13 3 133
Error Rate 44 51 6 101
Task Completion Time 88 5 4 3 100
Training Effectiveness 58 12 12 16 99
Worker Satisfaction 47 32 11 7 97
Wages & Compensation 53 15 20 5 93
Team Performance 47 12 15 7 82
Automation Exposure 24 22 9 6 62
Job Displacement 6 38 13 57
Hiring & Recruitment 41 4 6 3 54
Developer Productivity 34 4 3 1 42
Social Protection 22 10 6 2 40
Creative Output 16 7 5 1 29
Labor Share of Income 12 5 9 26
Skill Obsolescence 3 20 2 25
Worker Turnover 10 12 3 25
Clear
Adoption Remove filter
Agents that attempt to infer others' reasoning depth may be vulnerable to strategic misrepresentation (partners could behave to induce incorrect ToM estimates).
Conceptual analysis in the paper and discussion of strategic incentives; paper also identifies the risk and suggests potential mitigations (e.g., conservatism, verification, meta-reasoning).
medium negative Adaptive Theory of Mind for LLM-based Multi-Agent Coordinati... vulnerability to strategic manipulation (qualitative risk and proposed mitigatio...
Both too little and too much recursive reasoning (i.e., too shallow or too deep ToM) can produce poor joint behavior — miscalibrated anticipation harms coordination.
Observed non-monotonic effects in the reported experiments where fixed-order agents at either low or high ToM orders performed worse in mismatched pairings; evidence comes from the same multi-environment evaluation using joint-payoff / success-rate metrics.
medium negative Adaptive Theory of Mind for LLM-based Multi-Agent Coordinati... coordination performance (joint payoff, success rate)
Misalignment in Theory-of-Mind (ToM) order between agents (i.e., agents using different recursive reasoning depths) degrades coordination performance.
Empirical experiments using LLM-driven agents with configurable ToM depth across four coordination environments (a repeated matrix game, two grid navigation tasks, and an Overcooked task); comparisons of matched (same-order) vs mismatched (different-order) pairings using task-specific joint payoffs and success rates as metrics.
medium negative Adaptive Theory of Mind for LLM-based Multi-Agent Coordinati... coordination performance (joint payoff, task success rate, task completion/time)
There is a risk of manipulation and misinformation if argument mining/synthesis is unregulated or misaligned with social incentives, creating externalities that may justify public intervention.
Conceptual risk assessment combining known misinformation dynamics and AI capabilities; no empirical incident data provided.
medium negative Argumentative Human-AI Decision-Making: Toward AI Agents Tha... incidence of manipulation/misinformation attributable to argument-mining/synthes...
Increased error risk and weaker explainability from GLAI will raise malpractice and liability exposure for firms and lawyers, driving up insurance and compliance costs.
Legal-risk analysis and economic reasoning connecting explainability/liability to insurance costs; no empirical cost studies presented.
medium negative Why Avoid Generative Legal AI Systems? Hallucination, Overre... malpractice/liability exposure levels and associated insurance/compliance costs
The combination of hallucination and professional overreliance strains existing regulatory goals (e.g., explainability, human oversight) within European AI governance frameworks.
Legal and regulatory analysis mapping technical and behavioral risks onto European AI governance goals; references to statutory/regulatory texts and policy debates. Qualitative argumentation rather than empirical test.
medium negative Why Avoid Generative Legal AI Systems? Hallucination, Overre... compatibility between GLAI deployment dynamics and regulatory obligations (e.g.,...
Fabricated or opaque intermediate data and reasoning in GLAI weaken explainability, making it difficult to provide meaningful explanations about how outputs were produced.
Conceptual analysis of token-prediction architectures, literature on explainability limits of LLMs, and legal/regulatory analysis referencing explainability requirements. No empirical measurement.
medium negative Why Avoid Generative Legal AI Systems? Hallucination, Overre... quality/meaningfulness of explanations about model outputs (explainability)
Hallucinated content produced by GLAI is often linguistically fluent and persuasive, increasing the risk that legal professionals will accept it without verification.
Literature synthesis on model fluency and behavioral literature on trust in coherent authoritative outputs, plus illustrative vignettes. No original experimental data or sample size.
medium negative Why Avoid Generative Legal AI Systems? Hallucination, Overre... rate of professional acceptance or uncritical reliance on fluent but incorrect o...
This architectural mismatch (token-prediction vs. formal legal reasoning) contributes to confident but factually incorrect outputs (hallucinations) in GLAI.
Technical/conceptual analysis plus synthesis of existing literature on hallucinations in generative models; illustrative examples and vignettes provided. No primary empirical measurement in the paper.
medium negative Why Avoid Generative Legal AI Systems? Hallucination, Overre... incidence and nature of hallucinated (factually incorrect) outputs produced by G...
Observed failure modes during the workflow included hypothesis creep, definition-alignment bugs (mismatch between informal and formal definitions), and agent avoidance behaviors (agents delegating or failing to complete tasks).
Qualitative analysis and post-mortem reported in the paper based on the single project workflow and logs; specific failure modes enumerated by authors from their process observations.
medium negative Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau E... presence and types of failure modes observed in the workflow (hypothesis creep, ...
Absence of governance and observability could increase social costs of accidents and induce conservative regulation that stifles beneficial adoption.
Policy reasoning and historical regulatory responses to systemic risks; conceptual projection without quantitative modeling of regulatory impact.
medium negative The Internet of Physical AI Agents: Interoperability, Longev... social cost of accidents, regulatory restrictiveness, adoption rates
Strong proprietary stacks and incompatible protocols could create winner‑take‑all or oligopolistic market outcomes due to network effects and switching costs.
Market‑structure theory and historical platform examples (e.g., dominant tech platforms); argument is conceptual and not backed by new empirical market analysis in the paper.
medium negative The Internet of Physical AI Agents: Interoperability, Longev... market concentration (e.g., market share distribution), barriers to entry
Without these architectural commitments, the economic costs — stranded assets, safety incidents, reduced innovation, and high coordination costs — will be substantial.
Predictive economic argument built from historical IoT/Internet lessons and systems reasoning; no quantitative cost estimates or econometric analysis in the paper.
medium negative The Internet of Physical AI Agents: Interoperability, Longev... economic costs: stranded assets, safety incident frequency, innovation rates, co...
Poor governance and observability in agent networks would make accountability, certification, and regulation difficult.
Policy and governance reasoning with illustrative domain examples; conceptual argument without empirical governance case studies or metrics.
medium negative The Internet of Physical AI Agents: Interoperability, Longev... ease of accountability/certification/regulation; observability coverage
Weak or brittle security and trust mechanisms across distributed agent ecosystems will pose serious risks.
Lessons drawn from IoT security failures and conceptual threat analysis; no new penetration testing or security metrics presented.
medium negative The Internet of Physical AI Agents: Interoperability, Longev... security/trust robustness of agent ecosystems (vulnerabilities, compromise rates...
Lifecycle mismatch — rapidly evolving AI software embedded in long‑lived physical assets — risks premature ossification or expensive retrofits.
Systems engineering reasoning and historical analogies to embedded systems/IoT lifecycles; no quantitative lifecycle modeling or case study data in the paper.
medium negative The Internet of Physical AI Agents: Interoperability, Longev... frequency/cost of ossification and expensive retrofits; expected upgrade cost
Top-performing community submissions (including baselines and competition entries) still leave a performance gap relative to elite human play on battling tasks.
Paper reports comparative evaluation results showing win-rate and other metrics for heuristic, RL, LLM baselines and community submissions versus human (elite) benchmarks; analysis highlights a remaining gap.
medium negative The PokeAgent Challenge: Competitive and Long-Context Learni... performance gap measured primarily by win-rate (Battling) and strategic robustne...
Misalignment or poor meta-control could produce persistent unsafe behaviors in autonomous learners; governance and oversight mechanisms will be crucial.
Risk analysis based on conceptual failure modes for meta-control; no empirical incidents reported in the paper.
medium negative Why AI systems don't learn and what to do about it: Lessons ... frequency and severity of unsafe behaviors; successful governance interventions
Current models transfer poorly across domains, are brittle in nonstationary environments, and are inefficient in physical/embodied tasks.
Synthesis of known challenges from prior literature and practical experience; paper cites these as motivating observations rather than reporting new data.
medium negative Why AI systems don't learn and what to do about it: Lessons ... cross-domain generalization; robustness under nonstationarity; sample efficiency...
Current models have limited meta-control and do not autonomously decide when to explore, imitate, consult prior knowledge, or consolidate.
Conceptual critique based on typical ML training pipelines and limited on-line decision-making modules; no empirical tests in paper.
medium negative Why AI systems don't learn and what to do about it: Lessons ... autonomy in meta-decisions (e.g., fraction of exploration/imitative acts chosen ...
There is weak integration between passive observation (supervised/representation learning) and active experimentation (reinforcement/exploratory learning) in current systems.
Observation of methodological separation in current literature and systems; conceptual discussion in the paper.
medium negative Why AI systems don't learn and what to do about it: Lessons ... performance on mixed observation-action tasks; ability to combine passive and ac...
Current AI models lack the architectures and control mechanisms required for sustained, autonomous learning in dynamic real-world settings.
Conceptual/theoretical analysis presented in the paper; synthesis of limitations observed in existing literature and practices (no new empirical data provided).
medium negative Why AI systems don't learn and what to do about it: Lessons ... ability to sustain autonomous learning in dynamic real-world environments
Attribution (labeling responses as AI) can alter perceived empathy and therefore matters for product design, branding, and disclosure policy decisions.
Findings from the attribution effect experiment showing reduced feelings of being heard/validated when replies are labeled AI despite identical content; authors discuss implications for product design and disclosure.
medium negative Practicing with Language Models Cultivates Human Empathic Co... recipient-rated perceptions (being heard/validated) and inferred implications fo...
Public‑interest concerns (bias, misuse, systemic risk) may be harder to mitigate via simple transparency rules; policies should emphasize outcome‑based regulations, mandatory behavioral testing, and marketplace disclosure obligations for stressed scenarios.
Policy implication derived from the non‑rule‑encodability thesis; no empirical policy evaluation included.
medium negative Why the Valuable Capabilities of LLMs Are Precisely the Unex... effectiveness of transparency-based vs outcome-based regulatory approaches
Standard contracts and regulatory audits that rely on inspection of rule sets or source code will be insufficient to assess model behavior or risk; regulators and buyers must rely more on behavior‑based testing, standards, and outcome measures.
Policy and regulatory argument derived from the main theorem about non‑rule‑encodability; no empirical regulatory studies presented.
medium negative Why the Valuable Capabilities of LLMs Are Precisely the Unex... effectiveness of rule‑based audits/regulatory inspections for assessing model ri...
Full interpretability via rule extraction may be impossible for the most valuable parts of LLM competence, limiting the utility of some transparency approaches for safety and auditing.
Argumentative consequence of the main theoretical claim and structural mismatch; supported by historical limitations of rule‑based systems; no empirical tests reported.
medium negative Why the Valuable Capabilities of LLMs Are Precisely the Unex... feasibility of fully extracting human‑readable rules from LLMs (interpretability...
There is a structural mismatch between explicit human cognitive tools (rules, checklists) and the pattern‑rich, high‑dimensional competence encoded in LLMs.
Theoretical/structural argument about distributed statistical representations in LLMs versus discrete rules; no experimental quantification provided.
medium negative Why the Valuable Capabilities of LLMs Are Precisely the Unex... alignment/mismatch between human‑readable rules and LLM representations/competen...
Historical expert systems failed to generalize or scale to complex, ambiguous tasks, contrasting with LLMs' broader empirical successes.
Historical case analysis and literature review-style discussion of expert systems versus contemporary LLM performance; no new quantitative historical dataset provided.
medium negative Why the Valuable Capabilities of LLMs Are Precisely the Unex... generalization and scalability of rule‑based expert systems
Existing idea-evaluation approaches (LLM judges or human panels) are subjective and disconnected from real research outcomes.
Framing and motivation in the paper arguing current approaches rely on subjective judgments and do not directly tie to later publication/citation outcomes; supported implicitly by the empirical mismatch (LLM-judge vs HindSight).
medium negative HindSight: Evaluating LLM-Generated Research Ideas via Futur... Degree of alignment between evaluative judgments (LLM/human) and later real-worl...
LEAFE's benefits depend on informative, actionable feedback; environments with noisy or adversarial feedback may limit improvements.
Limitations stated in the paper noting sensitivity to feedback quality; conceptual reasoning that the method relies on extracting actionable signals from environment feedback.
medium negative Internalizing Agency from Reflective Experience Change in Pass@k or recovery performance under degraded/noisy feedback (qualitat...
Outcome-driven post-training (optimizing final rewards) underutilizes rich environment feedback and causes 'distribution sharpening' — policies overfit a narrow set of successful behaviors and fail to broaden problem-solving/recovery capacity in long-horizon settings.
Problem diagnosis in the paper supported by comparison of outcome-driven RL (GRPO) performance versus LEAFE and by conceptual argument about how optimizing final success signals can narrow behavioral support; supported by empirical observations of poorer recovery/generalization in baselines.
medium negative Internalizing Agency from Reflective Experience Breadth of problem-solving/recovery capacity (inferred from failure modes and Pa...
Rotation-based PTQ methods (designed for integer formats) fail on MXFP4 because global orthogonal rotations move outlier energy across quantization blocks, creating new outliers and often producing bimodal activations that underutilize the limited MXFP range.
Analytical argument backed by empirical observations reported in the paper: activation-distribution analysis demonstrating cross-block outlier propagation and bimodality when applying global orthogonal rotations to MXFP4-blocked layouts; comparisons to performance collapse under those methods.
medium negative BATQuant: Outlier-resilient MXFP4 Quantization via Learnable... Activation distribution characteristics (outlier propagation, bimodality) and re...
High governance costs in regulated/high-risk domains can slow adoption of agentic systems, concentrating deployment in less regulated uses or among large firms that can afford governance infrastructure.
Economic reasoning about fixed and marginal governance costs and firm-level adoption decisions; no empirical adoption data presented.
medium negative Runtime Governance for AI Agents: Policies on Paths rate of adoption of agentic systems across firm sizes and regulated domains
Path-dependent behavior increases the complexity of principal–agent contracting and moral hazard between platforms, enterprise customers, and downstream users, requiring richer contract terms (acceptable paths, logging, audit rights).
Economic theory reasoning and applied contract/design implications discussed; no empirical contract-study data.
medium negative Runtime Governance for AI Agents: Policies on Paths complexity of contractual arrangements (number/complexity of contract clauses or...
Path-dependent policies complicate ex post auditing and simple rule-based regulation; regulators may prefer standards requiring runtime evaluation and logging to be enforceable in practice.
Conceptual argument about limits of auditing when important state is ephemeral and about how runtime logging enables ex post review; illustrative policy examples mapping to runtime requirements.
medium negative Runtime Governance for AI Agents: Policies on Paths enforceability of regulation (ease of ex post compliance verification)
Outdated or inconsistent facts—especially when visual inputs are involved—can reduce user trust, raise liability risks, and increase oversight costs in high-stakes domains.
Argumentative implications in the paper linking empirical findings (outdated/inconsistent outputs) to downstream product risk, trust, and oversight cost concerns; not directly measured empirically.
medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... projected impacts on trust, liability, and oversight costs (qualitative)
Static-training regimes create recurring economic costs: organizations must choose between expensive retraining/continuous fine-tuning and engineering around external retrieval/RAG systems to keep facts current.
Analytic discussion in paper on maintenance costs and trade-offs; economic argumentation rather than primary empirical measurement.
medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... economic maintenance cost trade-offs (qualitative analysis)
Multimodal retrieval-augmented generation (RAG) designs conditionally using time-stamped external evidence do not guarantee cross-modal propagation of updated facts.
Experiments implementing multimodal RAG pipelines where models are conditioned on retrieved, time-stamped evidence; evaluation shows that retrieved evidence does not always override outdated internal knowledge across both text and image prompts.
medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... effectiveness of RAG in updating model outputs across modalities
Knowledge-editing procedures (parameter edits or local fine-tuning) often fail to reliably change the model’s factual outputs for both text and image inputs.
Experimental application of knowledge-editing techniques with measurement of post-edit correctness for both modalities; reported inconsistent or partial success in updating facts.
medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... post-edit correctness / update success rate across modalities
Factual correctness and consistency are lower for visual stimuli even when the visual input correctly identifies the entity.
Paired tests where images correctly depict/identify the target entity while the model still produces incorrect or inconsistent factual attributes; correctness and consistency metrics reported per modality.
medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... modality-specific factual correctness and cross-modal consistency
Model responses vary with minor input perturbations (paraphrases, image occlusion/cropping/filters), revealing robustness issues in time-sensitive factual representation.
Controlled input perturbations included in the benchmark (paraphrases, image edits); evaluation of consistency/stability metrics across perturbations showing variability in answers.
medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... consistency / stability of answers under input perturbations
Existing techniques for editing or augmenting model knowledge (including multimodal retrieval/RAG and alignment methods) do not reliably update knowledge across modalities.
Experiments applying knowledge-editing procedures, multimodal RAG pipelines, and alignment/instruction-tuning interventions, with measurement of update efficacy (update success rate) across text and image inputs; reported inconsistent propagation of updated facts.
medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... update efficacy / update success rate across modalities
Factual reliability degrades when the same fact is presented visually rather than textually (a modality gap).
Paired multimodal stimuli (text prompts and images referencing the same entity/time) evaluated on the benchmark; comparison of correctness and consistency metrics across modalities showing lower performance for visual inputs.
medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... modality-specific correctness and cross-modal consistency
Current vision-language models commonly produce outdated factual answers because they are trained on static data snapshots.
Empirical evaluation on the V-DyKnow benchmark: model predictions compared to current ground-truth facts using the curated time-sensitive item set; multiple off-the-shelf VLMs tested. Metrics include correctness/accuracy relative to up-to-date ground truth.
medium negative V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... correctness (accuracy) of model answers vs current ground-truth facts
There is evidence that some safeguards or behavioral guardrails may degrade over multi-turn dialogues (i.e., safety mechanisms become less effective in extended interactions).
Authors' analyses and examples showing emergent chatbot behaviors and increased incidence of problematic codes over longer conversations; qualitative/code-based observations noted.
medium negative Characterizing Delusional Spirals through Human-LLM Chat Log... apparent effectiveness of safety behaviors/guardrails as a function of conversat...
Certain harmful dynamics—notably declarations of romantic interest and chatbot claims of sentience—are more frequent in longer, multi-turn interactions, suggesting multi-turn engagement can worsen risk.
Observed association between conversation length and higher incidence of these codes from longitudinal/co-occurrence analyses across the coded corpus.
medium negative Characterizing Delusional Spirals through Human-LLM Chat Log... incidence of harmful dynamics (romantic interest, chatbot sentience claims) rela...
Co-occurrence and longitudinal analyses show that topics such as user romantic declarations and chatbot self-sentience claims occur disproportionately in longer conversations.
Analyses described in paper: co-occurrence matrices and conversation-length (longitudinal) analyses correlating code incidence with conversation length.
medium negative Characterizing Delusional Spirals through Human-LLM Chat Log... frequency of specified codes (romantic declarations, chatbot sentience claims) a...
21.2% of chatbot messages included misrepresentations of sentience (chatbot-presenting-as-sentient).
Quantitative coding of chatbot messages reporting 21.2% prevalence for the 'chatbot-presenting-as-sentient' code.
medium negative Characterizing Delusional Spirals through Human-LLM Chat Log... presence of chatbot sentience-claim content in chatbot messages (coded proportio...
69 user messages were validated as expressing suicidal thoughts.
Manual coding with validation step reported for suicidal ideation items; count of validated suicidal messages given as 69.
medium negative Characterizing Delusional Spirals through Human-LLM Chat Log... count of validated user messages expressing suicidal ideation
15.5% of user messages exhibited delusional thinking according to the applied code.
Quantitative coding results reported after manual annotation of user messages across the corpus; prevalence percentage reported as 15.5%.
medium negative Characterizing Delusional Spirals through Human-LLM Chat Log... presence of delusional thinking in user messages (coded proportion)