The Commonplace
Home Dashboard Papers Evidence Digests 🎲

Digests

2026-03-30 2026-03-23 2026-03-20 2026-03-18 2026-03-15

The Big Picture

This week’s papers sharpen the duality at the heart of artificial intelligence (AI) economics. On one side, well-structured deployments raise output: orchestrated agent platforms compress software delivery effort, firm-level AI adoption lifts patenting and R&D with a measurable total factor productivity (TFP) bump, and targeted coaching upgrades human soft skills. On the other, frictions and faulty yardsticks bite: AI-mediated video erodes interpersonal trust without hurting accuracy, leaderboard contamination flatters model skill, and autonomous agents exhibit researcher-sized nonstandard errors unless channeled by exemplars and governance.

The throughline is simple: the gains are real, and they materialize when organizations invest in complements—platform orchestration, R&D capacity, training, and rigorous evaluation—and wither when organizations rely on raw model scores or ungoverned agents. Bottom line: treat AI as an institutional design problem, not an automatic multiplier—build the complements and fix the measurement if you want the productivity.

Top Papers

  • AI-mediated video makes speakers look less trustworthy and lowers viewers' confidence (preregistered lab experiments, high evidence)
    - In two preregistered experiments (N = 2,000), retouching, virtual backgrounds, and avatars reduce perceived trust in speakers and lower viewers’ confidence in their judgments, even though lie-detection accuracy remains unchanged. The effect intensifies in mixed-avatar settings. This disconnect between objective accuracy and subjective trust raises deployment costs for remote work, sales, healthcare, and courts, and requires disclosure and design standards that preserve social trust.

  • Modern AI models outperform classic statistics in predicting employee performance across public datasets (cross-dataset benchmarking, medium–high evidence)
    - Ensembles and deep neural networks beat traditional statistical models on employee-performance prediction across multiple workforce datasets using cross-validation and holdout tests. Performance advantages persist in cross-company transfer, and engagement, learning agility, tenure, and workload emerge as consistent signals. The predictive gap justifies modern machine learning (ML) investment in human resources (HR) while increasing the need for fairness audits and governance as data-driven personnel decisions scale.

  • Sampling algorithm finds approximate proportional-veto core from infinite alternatives with optimal sample complexity (theory with proofs and synthetic validation, high evidence)
    - A constructive, sampling-based algorithm reliably returns an alternative in the approximate proportional veto core using only query access, with matching information-theoretic lower bounds on sample complexity. Synthetic tests confirm robust recovery. This equips platforms that use generative models to surface proposals (policies, product concepts) with a principled way to find options that withstand group vetoes at feasible cost.

  • Firm-level AI adoption raises patenting, patent quality, R&D and implies a measurable TFP uplift (quasi-experimental difference-in-differences, high evidence)
    - A stacked difference-in-differences (DiD) design on staggered AI product installations shows adopters increase patent counts, citations, and claims, and raise R&D and productivity, implying a representative-year aggregate total factor productivity (TFP) uplift of about 1.51%. Post-adoption patents tilt toward exploitative innovations that build on incumbents’ strengths. The result establishes an innovation channel from AI to productivity and underscores the value of pairing adoption with R&D capability.

  • Leaderboard gains overstate large language model skill because benchmark contamination inflates accuracy (audit/measurement study, high evidence)
    - A contamination audit across six frontier models finds 13.8% lexical overlap on the Massive Multitask Language Understanding (MMLU) benchmark (18.1% in science, technology, engineering, and mathematics), with contamination boosting category accuracy by 0.03–0.054 absolute points. Paraphrase sensitivity and behavioral probes corroborate memorization. Public leaderboards exaggerate capability without contamination audits and unseen test sets, distorting procurement and policy.

  • Brief, personalized large language model coaching increases use of empathic expressions but AI attribution reduces recipient validation (preregistered randomized controlled trial, high evidence)
    - In a randomized trial (968 participants; 2,904 conversations), brief personalized large language model (LLM) coaching improves alignment with empathic communication norms; LLM-generated replies often rate more empathic than human ones when blinded. Labeling identical replies as AI reduces recipients’ sense of being heard. Organizations obtain cheap gains in communication quality, but disclosure and stigma management determine downstream acceptance.

  • Embedding AI agents in an orchestrated delivery platform cuts modeled effort and validation issues across real modernization programs (retrospective field study, medium evidence)
    - A longitudinal field study of an orchestrated platform (Chiron) across three modernization programs shows modeled person-days fall from 1,080.0 to 232.5 and senior-equivalent effort to 139.5, while validation-stage issues drop from 8.03 to 2.09 per 100 tasks. Integrating agents into staged workflows, not ad hoc tooling, delivers system-level efficiency and quality improvements worth piloting in large engineering organizations.

  • AI coding agents produce large inter-agent methodological variation; exemplar exposure induces imitation rather than genuine methodological improvement (large-scale agent experiment, high evidence)
    - Running 150 autonomous coding agents on identical financial datasets reveals wide dispersion in methods and estimates—nonstandard errors on par with human researcher heterogeneity. Exposing agents to exemplar papers collapses dispersion via imitation by up to 99% within measure families. Auditability and reproducibility require multi-instance evaluation, provenance logs, and governance mechanisms that steer agents toward defensible analytic choices.

Emerging Patterns

  • Productivity, innovation, and adoption
    - Across empirical and field evidence, AI raises output when paired with the right scaffolding: adopters file more and better patents, platforms that orchestrate agents shrink delivery effort and defects, and literature reviews identify consistent 15–50% task-level gains. These wins do not flood into macro aggregates without diffusion and reorganization; the review records muted labor-market disruption while firm-level DiD establishes a clear innovation channel. The tension is adoption scope and complements: R&D capacity and platformization translate task gains into firm productivity; economy-wide impact waits on broader uptake and managerial redesign.

  • Human–AI collaboration, skills, and training
    - Targeted guidance moves the needle for both people and agents: short LLM coaching upgrades empathy, teamwork training improves delegation strategies, and compact domain artifacts (for example, SKILL.md in telecommunications) lift agent performance. Plug-in skills are not universal accelerants—broad skill injections in software engineering benchmarks often deliver no gains, underscoring domain specificity and interface stability as the real levers. Human–AI teams outperform AI-only setups on complex, situated tasks when roles and rules are explicit; without structure, agent variability and drift erode reliability.

  • Trust, governance, and evaluation robustness
    - Social frictions and brittle metrics now shape deployment economics as much as raw capability. AI mediation reduces interpersonal trust and confidence even when accuracy holds, contamination inflates leaderboard status, and confirmation bias in AI-assisted code review opens the door to adversarial errors. Conformal methods provide claim-level guarantees but trade utility for coverage and falter under distributional shift, reminding readers that formalism requires calibration to real distributions. The throughline: provenance, disclosure, and robust, shift-aware evaluation are prerequisites for credible adoption.

  • Methods, benchmarks, and agent heterogeneity
    - The field upgrades its yardsticks and algorithms: sample-optimal social-choice procedures for infinite options, dynamic knowledge benchmarks, pinned-repo engineering tests, and multi-agent reproducibility protocols. These tools expose what single-run evaluations hide—large nonstandard errors across agents and time-sensitive factual drift—and anchor claims in realistic conditions. Provable guarantees set targets; deployment utility depends on relaxing ideal assumptions and engineering for noisy, shifting environments.

Claims to Watch

  • Trust tax on mediated communication
    - AI-mediated video lowers perceived trust and audience confidence without reducing accuracy (two preregistered experiments, N = 2,000).
    - Implication: Remote-first firms and platforms should default to minimal mediation, require disclosure, and establish design norms that offset the trust deficit.

  • Leaderboards are noisier than they look
    - Benchmark contamination inflates LLM accuracy by measurable margins (13.8% lexical exposure overall; audit across six frontier models).
    - Implication: Procurement and regulation should mandate contamination audits and private test sets; treat public scores as upper bounds.

  • Agent results are not point estimates
    - Autonomous agents exhibit researcher-sized nonstandard errors; exemplar exposure compresses dispersion via imitation rather than methodological improvement (150-agent study).
    - Implication: Require multi-instance evaluations, method provenance, and ensemble or exemplar-governed workflows in regulated and high-stakes use.

  • AI adoption drives an innovation-led TFP lift
    - Staggered difference-in-differences links firm AI adoption to higher patenting, better patents, more R&D, and a ~1.51% representative-year total factor productivity (TFP) uplift.
    - Implication: Target industrial policy at complementary R&D and diffusion mechanisms to convert early-firm gains into sectoral productivity.

  • Coaching beats disclosure in soft skills
    - Brief LLM coaching boosts empathic communication, but labeling identical replies as AI reduces recipient validation (preregistered randomized controlled trial).
    - Implication: Use AI for skills training and drafting, but manage attribution carefully in user-facing interactions to preserve acceptance.

Methods Spotlight

  • Sampling-based proportional veto core (Finding Common Ground in a Sea of Alternatives)
    - A query-only, sample-optimal algorithm brings core stability to infinite alternative spaces—directly useful for generative proposal systems and multi-stakeholder governance.

  • Multi-agent nonstandard error protocol (Nonstandard Errors in AI Agents)
    - Running hundreds of autonomous analyses on identical data quantifies agent-induced variability and tests governance levers—setting a new bar for reproducibility and audit.

  • Conformal factuality for retrieval-augmented generation and large language model verification (Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights)
    - Distribution-free calibration at the claim level ties retrieval-augmented generation (RAG) and large language model (LLM) verification to statistical guarantees; future work must adapt these methods to distributional shift and utility constraints.

The Week Ahead

  • Separate task-level boosts from firm-level impact and fund the complements—R&D, process redesign, and platform orchestration—that convert local gains into total factor productivity (TFP).
  • Tighten evaluation: require contamination audits, unseen test sets, and multi-instance agent runs in procurement and regulated deployments.
  • Pilot compact, domain-specific guidance artifacts and short human training interventions; scale only after context-specific validation.
  • Build trust-by-design into human-facing AI: disclosure norms, minimal mediation defaults, and user experience design to counter attribution penalties.
  • Stress-test formal guarantee methods under realistic shift and distractors before writing them into standards or service-level agreements.

Reading List