Agentic AI that learns from prior experiments outperforms human-guided design: in two large field trials of prescription messaging the tool-augmented AI produced a message with a 69.8% click-through rate, 6.5 percentage points above baseline, while general-purpose LLMs without access to experimental data failed to identify winners.

Beyond One-shot: AI Agents for Learning in Field Experiments

Junjie Luo, Ritu Agarwal, Gordon Gao · June 01, 2026

arxiv rct high evidence 7/10 relevance Source PDF

An agentic, tool-augmented AI that autonomously learned from prior A/B test data generated messaging interventions that outperformed human+chatbot designs in healthcare prescription messaging, with the best AI message achieving a 69.8% CTR (+6.5 percentage points over baseline).

Organizations routinely run experiments for A/B testing, yet the data generated from one experiment is underutilized to inform subsequent intervention design. Significant barriers exist to extracting actionable knowledge from prior experimental data to inform new interventions. We study whether tool-augmented agentic AI can automatically learn from experimental data to generate new interventions in subsequent experiments. Through two-stage field experiments in healthcare prescription messaging (693,139 patient visits), we compare a Human + Chatbot method (Stage 1: behavioral experts with conversational AI co-designing 13 message variants, 444,691 patient visits) against a Tool-Augmented Agentic AI method (Stage 2: AI autonomously extracting principles from Stage 1 data to generate 17 new variants, 248,448 patient visits). The Agentic AI method, equipped with analytical tools, structured Data-Information-Knowledge-Wisdom (DIKW) reasoning agents, and transparent evidence chains, produces superior interventions: the best AI-generated message achieved a 69.8% CTR (+6.5 percentage points over baseline). Critically, our results suggest that the value comes from domain-specific experimental data, not from general reasoning ability: frontier LLMs operating without experimental data failed to predict which interventions would succeed. The field experiments also revealed that general-purpose behavioral theories used for intervention design do not extend uniformly to specific healthcare contexts, motivating an agentic AI approach to theory audits at field-experiment scale. Our research shows that tool-augmented AI can learn from experimental data and generate improved domain-relevant interventions, transforming behavioral experimentation from one-shot evaluation into a scalable system for cumulative design learning.

Summary

Main Finding

Tool-augmented agentic AI that executes code, runs statistical analyses, and reasons through a structured Data→Information→Knowledge→Wisdom (DIKW) pipeline can learn from prior field-experiment data and generate superior behavioral interventions. In two sequential randomized field experiments on prescription SMS messaging (693,139 patient visits total), AI-generated messages from the agentic system produced the best-performing variant (69.8% click-through), outperforming the prior-round human + chatbot designs and frontier LLMs that lacked access to the experimental data. The performance gains derived from grounding in domain-specific experimental evidence and transparent evidence chains, not from raw LLM generative ability alone.

Key Points

Two-stage megastudy design:
- Stage 1 (Jun 16–Jul 3, 2025): Human experts + conversational LLM co-designed 13 message variants; 444,691 patient visits.
- Stage 2 (Aug 25–Sep 8, 2025): Tool-augmented agentic AI extracted principles from Stage 1 and generated 17 new variants (tested alongside 3 Stage-1 baselines → 20 variants total); 248,448 patient visits.
Primary outcome: click-through rate (CTR) on the SMS link; secondary outcome: post-click authentication.
Best AI-generated message achieved 69.8% CTR (reported +6.5 percentage points over the Stage 2 baseline in the paper; cited as a 10.3% relative improvement in contexts reported by the authors).
Effective behavioral principles surfaced by the AI: efficiency framing and professional authority. Ineffective principles in this healthcare context: social proof and reciprocity — indicating limits to transferring general nudging theory across contexts.
Frontier LLMs without access to the experimental dataset (i.e., using only general training data) could not reliably predict which interventions would succeed — highlighting that proprietary, domain-specific experimental data were the key value source.
System architecture highlights: code execution for statistical analysis, multi-agent DIKW reasoning layers, and transparent evidence chains linking each design decision back to specific experimental observations.

Data & Methods

Setting: prescription notification SMS from a large U.S. patient messaging platform. Only message text varied; sender, timing, and link mechanics held constant.
Sample: 693,139 randomized patient-visit invitations across two stages (Stage 1: 444,691; Stage 2: 248,448). Randomization balance confirmed (χ2 p > 0.05).
Outcomes:
- Primary: binary click-through on the SMS link.
- Secondary: authentication after click (deeper engagement).
Intervention design comparisons:
- Baseline: expert + chatbot co-designed messages (Stage 1).
- Treatment: agentic AI (Stage 2) that:
  - Ingested Stage 1 experimental data,
  - Executed statistical analyses (code-enabled tool use),
  - Employed DIKW-structured agents to infer behavioral principles,
  - Generated alternative message texts with explicit evidence chains tied to the analyzed data.
Validation: Stage 2 was a randomized field trial of AI-designed messages. The study also evaluated frontier LLMs operating without access to the Stage 1 experimental data; those models failed to predict success.
Robustness notes: message assignment was randomized and demographic/medical context covariates were recorded to explore heterogeneity (age cohorts, gender, therapeutic category, provider specialty, geography/ADI). The paper emphasizes traceability/auditability of the AI reasoning.

Implications for AI Economics

Value creation from proprietary data and tooling, not LLM architecture alone:
- Economic value accrues to organizations that (a) collect large, domain-specific experimental datasets and (b) deploy tool-enabled agentic AI to extract reusable design knowledge. This favors incumbents or platforms with rich customer-experiment histories.
Moves experimentation from one-shot evaluation to cumulative design learning:
- Agentic AI can operationalize iterative improvement cycles at scale, increasing the return on each experiment and potentially lowering cost-per-improvement in behavioral interventions.
Market and product implications:
- Demand for experiment-management platforms, analytics tooling, and agentic-AI systems that provide auditable evidence chains will grow.
- There is commercial scope for specialized “design-learning” services that convert experimental data into intervention libraries or reusable design primitives.
Labor and organizational implications:
- Human experts remain important (for supervision, ethics, and domain judgment), but routine extraction of design principles and large-scale variant generation can be automated, shifting expert work toward oversight, validation, and higher-order theory building.
Policy, regulation, and welfare considerations:
- Transparent evidence chains and auditable DIKW pipelines are crucial for accountability, especially in sensitive domains (healthcare). Regulators and institutions should require traceability for automated intervention design.
- Care is needed before scaling to outcomes beyond proximal engagement (CTR). Economic benefits (e.g., improved adherence, cost savings) require linking improved engagement to downstream clinical and cost outcomes — an avenue for future evaluation.
Research and methodological implications:
- The result cautions against relying on general nudging heuristics; scalable, domain-specific auditing of behavioral theories via agentic AI can reveal which principles transfer and which do not.
- Future economic research should quantify welfare gains from cumulative experimental learning and study competitive dynamics where data-holding platforms can iterate faster.

Limitations to keep in mind: the paper reports gains in a messaging engagement metric (CTR) rather than downstream clinical outcomes; generalizability beyond prescription messaging and other healthcare populations requires further testing; and ethical/regulatory constraints around automated patient-targeted interventions must be actively managed.

Assessment

Paper Typerct Evidence Strengthhigh — Very large-scale, real-world field experiments (693,139 patient visits) with randomized assignment to messaging variants and an objective behavioral outcome (CTR) provide strong causal evidence that tool-augmented agentic AI can generate superior interventions; the two-stage design (use of prior-experiment data to design new treatments) strengthens the claim about learning from experimental data. Remaining caveats (single application domain, reporting details not provided here) do not negate the core randomized evidence. Methods Rigorhigh — Design uses randomized controlled field tests with very large samples and a clear, measurable outcome, plus a replication/extension via a second stage driven by the Stage 1 data; the paper also compares to LLM baselines to isolate the value of experimental data. Potential weaknesses (not specified in the summary) include how multiple hypothesis testing was handled, pre-registration/blinding details, treatment delivery integrity, and transparency of AI toolchains — these are important but do not outweigh the strong RCT backbone. SampleField sample of 693,139 patient visits in a healthcare prescription messaging context: Stage 1 — 444,691 visits testing 13 message variants co-designed by behavioral experts + conversational AI; Stage 2 — 248,448 visits testing 17 AI-autonomously-generated variants derived from Stage 1 data; primary outcome is click-through rate on prescription messages; population and geographic context not specified in the summary. Themeshuman_ai_collab innovation IdentificationRandomized field experiments (A/B tests) assigning patient visits to message variants in two stages: Stage 1 tested 13 human+chatbot-designed messages; Stage 2 tested 17 AI-generated messages that were constructed using Stage 1 experimental data; causal effects are estimated by comparing CTRs across randomized arms (including a baseline control). GeneralizabilitySingle domain: healthcare prescription messaging — results may not generalize to other sectors or types of interventions, Outcome limited to short-term engagement (CTR), not downstream health or economic outcomes, Population/context details (e.g., country, patient demographics, health system) not specified and may limit transferability, Intervention medium is messaging (digital communications); effects may differ for in-person or policy interventions, Stage-specific learning may rely on properties of the initial experimental variants; success could vary where prior experiments are sparse or noisy

Claims (8)

Claim	Direction	Confidence	Outcome	Details
The best AI-generated message achieved a 69.8% CTR (+6.5 percentage points over baseline). Adoption Rate	positive	high	click-through rate (CTR)	n=248448 69.8% CTR (+6.5 percentage points over baseline) 1.0
A tool-augmented agentic AI method (equipped with analytical tools, structured DIKW reasoning agents, and transparent evidence chains) can automatically learn from experimental data to generate new interventions and produce superior interventions compared to Human + Chatbot co-design. Adoption Rate	positive	high	performance of message interventions (measured by CTR and comparative success of variants)	n=693139 1.0
Stage 1 (Human + Chatbot) produced 13 message variants and was tested on 444,691 patient visits. Research Productivity	null_result	high	experiment sample allocation (number of message variants and patient visits)	n=444691 1.0
Stage 2 (Tool-Augmented Agentic AI) autonomously extracted principles from Stage 1 data and generated 17 new message variants tested on 248,448 patient visits. Research Productivity	null_result	high	experiment sample allocation (AI-generated variants and patient visits)	n=248448 1.0
The value in generating better interventions comes from domain-specific experimental data, not from general reasoning ability of frontier LLMs: frontier LLMs operating without experimental data failed to predict which interventions would succeed. Decision Quality	negative	medium	prediction accuracy / ability to identify successful interventions	0.36
General-purpose behavioral theories used for intervention design do not extend uniformly to this specific healthcare context, motivating an agentic AI approach to theory audits at field-experiment scale. Research Productivity	negative	medium	generalizability/applicability of behavioral theories to the tested context	n=693139 0.36
Two-stage field experiments in healthcare prescription messaging encompassed 693,139 patient visits in total. Research Productivity	null_result	high	total experimental sample size	n=693139 1.0
Tool-augmented AI can transform behavioral experimentation from one-shot evaluation into a scalable system for cumulative design learning by learning from experimental data and generating improved domain-relevant interventions. Organizational Efficiency	positive	medium	scalability and cumulative learning capability of behavioral experimentation systems	n=693139 0.36