The Commonplace
Home Dashboard Papers Evidence Digests 🎲

Digests

2026-03-30 2026-03-23 2026-03-20 2026-03-18 2026-03-15

The Big Picture

This week’s research tells a single, bracing story: artificial intelligence (AI) delivers big gains when it is curated, calibrated, and constrained—and it backfires when it is not. Field and firm evidence show material productivity and innovation uplifts: AI‑guided irrigation boosts yields 35% while slashing inputs; adopters file more and better patents, implying a 1.5% lift to total factor productivity (TFP). But the upside is fragile. Frontline accuracy rises only when the assistant is highly accurate; bad guidance drags people down. Soft‑skill coaching improves empathic communication, yet recipients devalue identical text when it is labeled “AI.” Benchmarks overstate capability due to contamination; agents perform well on polluted tests and fail when the test is clean or scaffolds vary.

The throughline is operational discipline. Human‑in‑the‑loop workflows, accuracy gating, domain‑specific skills, and clean measurement separate value from vapor. Policy framing also shapes adoption: ambiguity about privacy risk suppresses uptake more than a known moderate risk—demonstrating that the message is as decisive as the model.

Bottom line: AI’s economic payoffs are real and uneven—and they accrue to organizations that pair targeted deployment with hard‑nosed governance, contamination‑proof measurement, and clear risk disclosure.

Top Papers

  • High‑quality large language model (LLM) suggestions boost caseworker accuracy; bad suggestions substantially harm it (randomized controlled trial, high evidence)
    - A large randomized controlled trial (RCT) with nonprofit Supplemental Nutrition Assistance Program (SNAP) caseworkers shows baseline accuracy at 49% and a 27‑percentage‑point jump with high‑accuracy chatbot suggestions; low‑quality suggestions reduce performance. Gains plateau as assistant accuracy approaches perfection, revealing diminishing marginal improvement. This sets a clear deployment rule for high‑stakes services: accuracy gating, abstention, and escalation paths are nonnegotiable.
    
    • AI‑assisted irrigation raises wheat yields by 35% while cutting water and energy use ~30% in Baghdad trials (field experiment, medium‑high evidence)

      • On‑station trials report 35% higher yields, 36% less water, and 30% less energy, doubling water‑use efficiency and delivering a 30% internal rate of return. In water‑scarce settings, AI‑guided irrigation is both a productivity lever and a resource hedge. This is shovel‑ready technology with private and social returns.
      • Platform work reaches ~4.2% of employment across 24 OECD countries; reclassification as employees cuts supply but raises pay for remaining workers (cross‑country administrative + platform data with structural modeling, medium evidence)

      • Platform‑mediated gig work accounts for 4.2% of jobs and 12.8% of labor income among participants. Simulated reclassification shrinks platform labor supply ~18% while lifting hourly pay ~31%, yet median platform pay still trails comparable jobs by ~22%. The policy tradeoff is stark: fewer gigs, better pay, and a persistent quality gap to traditional employment.

      • Firm AI adoption increases patenting, patent quality, R&D and implies a ~1.5% boost to aggregate TFP post‑adoption (firm‑level difference‑in‑differences, medium‑high evidence)

      • A staggered‑adoption difference‑in‑differences (DiD) design shows AI adopters generate more patents with higher citations and richer claims, tilt toward exploitative patents while raising originality and generality metrics, and scale R&D. Aggregating firm‑level impacts implies ~1.51% TFP uplift in representative post‑adoption years. The innovation channel is the mechanism.

      • Brief personalized large language model (LLM) coaching improves empathic expression but AI‑attributed replies are rated less validating (pre‑registered randomized controlled trial, high evidence)

      • Personalized LLM coaching causally improves alignment with normative empathic communication; blinded ratings often score LLM replies as more empathic than human ones. Yet labeling identical replies as “AI” makes recipients feel less heard. The teachable takeaway: use LLMs to upskill people, not to impersonate them.

      • Ambiguity about data‑leak probability reduces AI personalization uptake; a known 30% risk does not (incentivized online randomized controlled trial, medium‑high evidence)

      • When privacy risk is ambiguous (e.g., a 10–50% leak range), adoption of personalized AI drops; when risk is explicit (30%), adoption holds around 50% and is insensitive to framing. Participants over‑pay for transparency labels, revealing demand for clarity. Regulators and product teams should standardize numeric disclosures to sustain adoption without sugar‑coating risk.

      • Human‑curated procedural skills raise LLM agent pass‑rates ~16 percentage points on average; model‑self authored skills do not (benchmark, medium evidence)

      • Across 86 tasks and 11 domains, human‑authored skills lift agent success by 16.2 percentage points on average, with wide heterogeneity (e.g., +52 pp in healthcare vs +5 pp in software). Focused, concise skills outperform sprawling documentation; smaller models plus curated skills match larger models without them. Knowledge engineering outperforms “let the model figure it out.”

      • Leaderboard scores overstate LLM capability because benchmark contamination inflates measured accuracy (contamination audit/meta‑evaluation, medium‑high evidence)

      • A contamination audit finds 13.8% lexical overlap on Massive Multitask Language Understanding (MMLU) and 18.1% overlap in STEM, and quantifies accuracy inflation from exposed items; memorization patterns differ by model, skewing comparisons. Public leaderboards over‑credit capability and understate uncertainty. Procurement and policy must require contamination audits and secret holdouts.

      • Contamination‑controlled re‑evaluation finds AI agents unstable and insufficient for end‑to‑end automated smart‑contract auditing (benchmark replication/audit, medium evidence)

      • Re‑testing with 26 configurations across four model families shows brittle, scaffold‑sensitive performance. On a contamination‑free set of post‑release incidents, end‑to‑end exploit success plunges, underscoring the need for human auditors. Vendor scaffolds overstate capability; buyers should insist on contamination‑free evaluations.

Emerging Patterns

  • Bold human–AI collaboration, fragile acceptance - Controlled assistant quality changes outcomes. In frontline services, accurate suggestions lift accuracy, while wrong ones harm it; in soft‑skill coaching, LLMs teach empathy effectively, yet recipients discount content labeled as AI. Agentic systems amplify this fragility: performance swings across scaffolds and configurations make automation claims unsafe without human oversight. Curated, narrow skills reduce instability, but domain heterogeneity is real—software engineering sees modest gains while healthcare and other procedural domains see large ones.
  • Productivity and innovation gains concentrate where complements are in place - AI raises output and input efficiency in the field and intensifies firm innovation pipelines. The largest gains appear where digital infrastructure, absorptive capacity, and domain‑specific playbooks exist—precision irrigation on research stations, design‑oriented firms with management buy‑in, and HR functions with analytics capability. The macro effect is lumpy: diffusion without complements delivers tepid returns, while targeted deployments deliver step‑changes.
  • Measurement is the guardrail—and it is currently loose - Contaminated benchmarks and scaffold gaming inflate capability, while single‑run measurements hide stochastic variance. Clean evaluations (contamination‑free incident sets, secret holdouts) and repeated sampling frameworks expose true dispersion in outcomes. Calibration methods with formal guarantees reduce some risks, but they erode utility under distributional shift; pragmatic governance requires audits, uncertainty intervals, and domain‑realistic testbeds over shiny leaderboards.
  • Governance and disclosure shape adoption as much as capability - Ambiguity aversion depresses take‑up more than a stated moderate risk, making precise probabilistic disclosures a policy lever. Public attitudes toward government AI hinge more on information treatments than direct experience, and environmental externalities of inference‑heavy models rise as disclosure wanes. Clear, standardized disclosure and model‑level reporting are prerequisites for sustaining adoption and legitimacy.

Claims to Watch

  • Accuracy gating beats “assist everywhere” - Claim: In high‑stakes decisions, low‑quality AI suggestions reduce human accuracy while gains plateau at high assistant accuracy (randomized controlled trial with SNAP caseworkers). - Implication: Set minimum precision thresholds, enable abstention, and route edge cases to humans before scaling assistants.
  • Ambiguity aversion is the real adoption killer - Claim: Ambiguous leak probabilities suppress personalization uptake more than an explicit 30% risk (incentivized online RCT). - Implication: Mandate numeric risk disclosures and certify privacy labels to maintain adoption without obscuring hazards.
  • Human knowledge engineering outperforms self‑generated skills - Claim: Curated procedural skills add ~16 percentage points to agent success; model‑authored skills add nothing on average (multi‑domain benchmark). - Implication: Fund domain playbooks and skill libraries; do not rely on automated skill generation for mission‑critical tasks.
  • Reclassifying gig workers trades quantity for quality - Claim: Moving platforms to employee status cuts labor supply ~18% but raises hourly pay ~31%, with median pay still below traditional jobs (cross‑country data with modeling). - Implication: Pair reclassification with social‑insurance design and realistic price expectations; expect tighter supply and higher unit costs.
  • Leaderboards are inflated and misrank models - Claim: Benchmark contamination (13.8% on MMLU) and model‑specific memorization patterns overstate accuracy and distort comparisons (contamination audit). - Implication: Require contamination audits and secret holdouts in procurement; discount public leaderboard deltas.

Methods Spotlight

  • Randomized assistant‑accuracy manipulation (LLMs in social services: How does chatbot accuracy affect human accuracy?) - A clean causal design that varies assistant reliability to measure human performance effects in high‑stakes tasks—an exportable template for human–AI workflow evaluation.
  • SkillsBench multi‑domain agent‑skills framework (SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks) - Standardizes measurement of curated versus self‑authored skills across tasks and configurations, enabling quantification of model–skill tradeoffs and optimal skill granularity.
  • Stochastic‑dominance‑constrained reinforcement learning from human feedback (RLHF) (Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control) - A differentiable optimal‑transport objective that controls tail risk beyond means, giving safety researchers a tractable way to suppress rare catastrophic outcomes.

The Week Ahead

  • Treat leaderboards as marketing, not measurement; demand contamination‑free test sets, repeated sampling, and uncertainty intervals before procurement.
  • Gate assistants by measured precision and enable abstention; pilot with randomized accuracy manipulations and hard escalation paths in public‑service deployments.
  • Invest in concise, domain‑specific skills and playbooks; A/B test skills by domain, starting where heterogeneity of returns is high.
  • Standardize probabilistic privacy disclosures and certify labels; ambiguity costs adoption more than candid risk.
  • Pair AI diffusion with complements—digital infrastructure, R&D budgets, and workforce training—to capture innovation gains and blunt distributional shocks.

Reading List