The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

A three-arm ad experiment shows the ad-delivery algorithm—not the creative—drives most audience reallocation: in a Meta campaign the algorithm increased female impression share by 2.07 percentage points while the creative reduced it by 0.68 points, meaning roughly three-quarters of the shift was algorithmic and a standard two-arm A/B test underestimates that effect by about half.

Algorithm or Creative? A Three-Arm Experimental Design for Decomposing Algorithmic Bias in Platform A/B Tests
Pallavi Pal, Anjana Susarla · May 22, 2026
arxiv rct high evidence 8/10 relevance Source PDF
A three-arm randomized test on Meta separates algorithmic targeting from creative effects and finds the platform's delivery algorithm drives most of the audience reallocation—raising female impression share by +2.07 percentage points while the creative itself reduces it by -0.68 points, and a conventional two-arm test substantially understates the algorithmic channel.

Online advertising platforms host hundreds of thousands of A/B tests, but the platform's delivery algorithm routes each creative to the audience it predicts will engage. Every two-arm test therefore conflates the creative's effect with the algorithm's targeting response, and adjusting for the realized audience is biased because audience is a post-treatment mediator. We propose a three-arm design that adds an arm exposing the algorithm to the treatment metadata while holding the user-facing creative identical to control, point-identifying the natural indirect (algorithmic) and direct (creative) effects without sequential ignorability. In a live Meta campaign with a women-targeted text fragment, the algorithmic channel raises female impression share by +2.07 ppt while the creative channel moves it by -0.68 ppt; roughly three-quarters of the absolute reallocation is algorithmic, and a conventional two-arm test understates the algorithmic channel by a factor of two. The design isolates the contribution of platform's algorithm to the outcome which is separable from creative content.

Summary

Main Finding

A simple three-arm experimental design disentangles the effect of an ad creative from the platform delivery algorithm in A/B tests. By adding an arm that exposes the platform to the treatment metadata while rendering the user-facing creative as control, the design point-identifies the algorithmic (indirect) and creative (direct) channels without relying on the usual cross-world/sequential-ignorability assumption. In a live Meta campaign where the treatment was a women-targeted text fragment, the delivery algorithm increased female impression share by +2.07 percentage points while the visible creative moved it by −0.68 ppt; the total effect was +1.39 ppt and roughly three-quarters of the absolute reallocation was algorithmic. A conventional two-arm A/B test understates the algorithmic contribution by about a factor of two and misattributes the entire change to the creative.

Key Points

  • Problem: Standard two-arm A/B tests conflate (1) the direct effect of a creative on a fixed audience and (2) the indirect effect through the platform delivery algorithm (divergent delivery). Post-treatment regression adjustment for realized audience is biased because audience is a mediator.
  • Design: Randomize units into three arms:
  • Treatment A with algorithm left to respond to A (A, S(A))
  • Treatment A with the algorithm forced to follow the distribution S(B) (A, S(B)) — new arm
  • Treatment B with algorithm responding to B (B, S(B)) Differences of arm means identify:
    • Natural indirect effect (NIE) = E[Y(A, S(B))] − E[Y(A, S(A))] (algorithmic channel)
    • Natural direct effect (NDE) = E[Y(B, S(B))] − E[Y(A, S(B))] (creative channel)
    • Total effect = NIE + NDE
  • Identification assumptions (transparent and testable):
    • (A1) Random assignment to arms.
    • (A2) Excludability (arm affects outcome only via the specified (D, S) pair).
    • (A3) In arm 2, the realized mediator S follows the marginal distribution of S(B) (verifiable by testing S distributions in arms 2 vs 3).
  • Advantages:
    • Avoids cross-world sequential ignorability; replaces it with a manipulable, testable design condition.
    • Each estimand is the difference of arm means; unbiased under the assumptions.
  • Practical considerations:
    • Arm 2 must be implementable by manipulating metadata/signals the platform uses for delivery.
    • Sample split into three arms reduces per-contrast power; unequal allocation (e.g., 1:2:1) can improve precision.
    • Check for spillovers; if present, cluster/aggregate assignment is needed.
  • Estimation & inference:
    • Point estimators: simple mean differences; standard errors via Neyman variance or heteroskedasticity-robust HC2/HC3 (authors use HC3 and wild bootstrap to control multiple testing).
    • Design validated in auction simulation and applied to a live Meta experiment.
  • Empirical finding (Meta experiment):
    • Algorithmic channel (NIE): +2.07 ppt female impression share.
    • Creative channel (NDE): −0.68 ppt.
    • Total effect: +1.39 ppt.
    • Algorithm concentrated new impressions on women aged 35–44 and reallocated away from 65+.
    • Conventional two-arm test misattributes the whole shift to creative and understates the algorithmic role by ~2×.

Data & Methods

  • Conceptual framing: Treat the platform delivery mechanism S as a mediator in the potential-outcomes / Pearl–Robins causal-mediation framework. Use a parallel/controlled-mediator design analogous to Imai et al. (2013).
  • Three-arm randomized experiment deployed on Meta for a campaign where treatment was adding a women-targeted text fragment.
  • Implementation detail: Arm 2 required manipulating the metadata signal available to the platform so that S in arm 2 matched the marginal distribution of S observed under B; equality of distributions is empirically tested (e.g., KS test).
  • Inference:
    • Use group mean differences for NIE, NDE, and TE.
    • Robust standard errors (HC3) and wild bootstrap (Rademacher weights) for multiple-testing correction and small-cluster adjustments.
    • Diagnostics for within-arm dispersion and heteroskedasticity; day-level clustering checks.
  • Validation:
    • Calibrated auction simulation (Meta-style second-price auction with matching signals) where true NIE/NDE/TE are known — used to validate that the three-arm design recovers the components.
    • Field experiment producing the substantive numerical decomposition reported above.
  • Limitations/assumptions called out by authors:
    • The experiment identifies population-average (marginal) natural direct/indirect effects (controlled/marginal CDE), not unit-level counterfactuals E[Y(a, S_i(b))] without untestable cross-world assumptions.
    • Requires the capacity to manipulate the metadata signal and to enforce or emulate the mediator distribution of arm B in arm 2.

Implications for AI Economics

  • Measurement and decision-making:
    • Many advertiser decisions based on two-arm A/B tests (e.g., scaling a creative) may be misdirected when platform algorithms reallocate impressions; the three-arm design allows managers to separate whether performance gains come from creative change or algorithmic reallocation.
    • Better allocation of budget: if gains are predominantly algorithmic, scaling creative content is less effective than leveraging the signal or targeting strategy that triggers algorithmic behavior.
  • Platform accountability and regulation:
    • The design gives a causal estimand for the platform’s contribution to differential exposure that is separable from advertiser creative choices. That quantity is directly relevant for fairness audits and regulatory inquiries in sensitive categories (housing, credit, employment).
    • Enables evidence-based attribution of disparate exposure to platform-side algorithms versus advertiser content, informing liability and mitigation strategies.
  • Experimental practice and IS research:
    • Adds a practical experimental tool to the toolkit for researchers studying algorithmic bias and divergent delivery; the assumptions are experimentally testable rather than purely observational.
    • Encourages platforms and advertisers to instrument metadata signals to enable such decompositions; suggests a standard approach for future platform-run experiments and academic audits.
  • Policy and market design:
    • If algorithmic channel dominates reallocation (as found), regulation focused only on advertiser content may miss a large source of disparate exposure; interventions might need to target platform delivery rules or require transparency/controls over signals the platform uses.
    • Platforms could offer advertisers a way to choose whether the delivery algorithm conditions on particular metadata, enabling more granular control over equity implications.
  • Costs and trade-offs:
    • The three-arm design reduces per-contrast power vs a two-arm test and requires operational capability to manipulate metadata and enforce distributional matching—these are practical costs that must be balanced against the benefit of causal decomposition.
    • Inference still yields population-average decompositions; some policy questions might require stronger (and untestable) assumptions to recover unit-level effects.
  • Research directions:
    • Apply the design across different platforms, outcomes (conversions, spend, geographic distribution), and protected attributes to map when algorithmic vs creative channels dominate.
    • Extend to clustered/spillover settings and longer-run equilibrium feedbacks of algorithmic delivery.
    • Combine with sensitivity analyses to bound unit-level cross-world deviations where those are of interest.

Overall, the paper provides a pragmatic, testable experimental solution to a widespread misattribution problem in platform A/B tests and establishes a clear causal object for measuring the platform’s role in differential exposure—an advance with direct managerial, regulatory, and research implications in AI-driven markets.

Assessment

Paper Typerct Evidence Strengthhigh — The paper uses a randomized field experiment that directly manipulates the inputs to the platform's delivery algorithm and the user-facing creative, giving strong internal validity and direct causal estimates of both algorithmic and creative channels; the mediation identification is formal and point-identified by design. Methods Rigorhigh — The design is a principled RCT with an added arm that cleanly separates mediator exposure from the user-facing treatment, allowing formal identification of natural direct and indirect effects; implementation in a live campaign and explicit comparison against the conventional two-arm estimator demonstrate both theory and practice. Remaining concerns (addressed in robustness checks or discussed) are typical RCT caveats such as interference, possible metadata leakage, and single-campaign scope. SampleA single live Meta advertising campaign using a women-targeted text fragment; impressions were randomized into three arms as described and the primary outcome was female impression share (proportion of impressions shown to women). The paper reports arm-level causal effects from impression- or delivery-level data from that campaign (sample size not specified in the summary). Themeshuman_ai_collab adoption IdentificationRandomized three-arm field experiment: (1) control creative and control metadata, (2) treatment creative and treatment metadata (standard A/B), and (3) control-facing creative paired with treatment metadata exposed to the platform's delivery algorithm. Random assignment of impressions/allocations isolates the algorithm's response (natural indirect effect) from the creative content (natural direct effect), point-identifying mediation effects without relying on sequential ignorability. GeneralizabilitySingle-platform (Meta) — platform-specific delivery algorithms may differ elsewhere, Single campaign and single creative type (a women-targeted text fragment) — effects may vary with creative modality, product, or audience, Outcome limited to impression-level gender composition — may not generalize to engagement, clicks, conversions, or economic outcomes like sales or revenues, Short-run field experiment — results may change as algorithms update or advertisers adapt, May not generalize to other targeting objectives or to organic (non-paid) allocation systems

Claims (10)

ClaimDirectionConfidenceOutcomeDetails
Online advertising platforms host hundreds of thousands of A/B tests. Other positive high count of A/B tests hosted on platforms
0.6
The platform's delivery algorithm routes each creative to the audience it predicts will engage. Task Allocation positive high audience routing by delivery algorithm
0.6
Every two-arm test conflates the creative's effect with the algorithm's targeting response. Decision Quality negative high confounding/bias in estimated creative effect
0.6
Adjusting for the realized audience is biased because audience is a post-treatment mediator. Decision Quality negative high bias from post-treatment adjustment
0.6
We propose a three-arm design that adds an arm exposing the algorithm to the treatment metadata while holding the user-facing creative identical to control, point-identifying the natural indirect (algorithmic) and direct (creative) effects without sequential ignorability. Research Productivity positive high identification of natural indirect and direct effects
0.6
In a live Meta campaign with a women-targeted text fragment, the algorithmic channel raises female impression share by +2.07 ppt. Task Allocation positive high female impression share (change attributable to algorithmic channel)
+2.07 ppt
0.6
In the same campaign, the creative channel moves female impression share by -0.68 ppt. Task Allocation negative high female impression share (change attributable to creative channel)
-0.68 ppt
0.6
Roughly three-quarters of the absolute reallocation is algorithmic. Task Allocation positive high share of total impression reallocation attributable to algorithm
roughly three-quarters
0.6
A conventional two-arm test understates the algorithmic channel by a factor of two. Decision Quality negative high bias/understatement factor in estimated algorithmic effect from two-arm test
factor of two
0.6
The design isolates the contribution of the platform's algorithm to the outcome which is separable from creative content. Research Productivity positive high isolated contribution of algorithm to outcome
0.6

Notes