A three-arm ad experiment shows the ad-delivery algorithm—not the creative—drives most audience reallocation: in a Meta campaign the algorithm increased female impression share by 2.07 percentage points while the creative reduced it by 0.68 points, meaning roughly three-quarters of the shift was algorithmic and a standard two-arm A/B test underestimates that effect by about half.
Online advertising platforms host hundreds of thousands of A/B tests, but the platform's delivery algorithm routes each creative to the audience it predicts will engage. Every two-arm test therefore conflates the creative's effect with the algorithm's targeting response, and adjusting for the realized audience is biased because audience is a post-treatment mediator. We propose a three-arm design that adds an arm exposing the algorithm to the treatment metadata while holding the user-facing creative identical to control, point-identifying the natural indirect (algorithmic) and direct (creative) effects without sequential ignorability. In a live Meta campaign with a women-targeted text fragment, the algorithmic channel raises female impression share by +2.07 ppt while the creative channel moves it by -0.68 ppt; roughly three-quarters of the absolute reallocation is algorithmic, and a conventional two-arm test understates the algorithmic channel by a factor of two. The design isolates the contribution of platform's algorithm to the outcome which is separable from creative content.
Summary
Main Finding
A simple three-arm experimental design disentangles the effect of an ad creative from the platform delivery algorithm in A/B tests. By adding an arm that exposes the platform to the treatment metadata while rendering the user-facing creative as control, the design point-identifies the algorithmic (indirect) and creative (direct) channels without relying on the usual cross-world/sequential-ignorability assumption. In a live Meta campaign where the treatment was a women-targeted text fragment, the delivery algorithm increased female impression share by +2.07 percentage points while the visible creative moved it by −0.68 ppt; the total effect was +1.39 ppt and roughly three-quarters of the absolute reallocation was algorithmic. A conventional two-arm A/B test understates the algorithmic contribution by about a factor of two and misattributes the entire change to the creative.
Key Points
- Problem: Standard two-arm A/B tests conflate (1) the direct effect of a creative on a fixed audience and (2) the indirect effect through the platform delivery algorithm (divergent delivery). Post-treatment regression adjustment for realized audience is biased because audience is a mediator.
- Design: Randomize units into three arms:
- Treatment A with algorithm left to respond to A (A, S(A))
- Treatment A with the algorithm forced to follow the distribution S(B) (A, S(B)) — new arm
- Treatment B with algorithm responding to B (B, S(B))
Differences of arm means identify:
- Natural indirect effect (NIE) = E[Y(A, S(B))] − E[Y(A, S(A))] (algorithmic channel)
- Natural direct effect (NDE) = E[Y(B, S(B))] − E[Y(A, S(B))] (creative channel)
- Total effect = NIE + NDE
- Identification assumptions (transparent and testable):
- (A1) Random assignment to arms.
- (A2) Excludability (arm affects outcome only via the specified (D, S) pair).
- (A3) In arm 2, the realized mediator S follows the marginal distribution of S(B) (verifiable by testing S distributions in arms 2 vs 3).
- Advantages:
- Avoids cross-world sequential ignorability; replaces it with a manipulable, testable design condition.
- Each estimand is the difference of arm means; unbiased under the assumptions.
- Practical considerations:
- Arm 2 must be implementable by manipulating metadata/signals the platform uses for delivery.
- Sample split into three arms reduces per-contrast power; unequal allocation (e.g., 1:2:1) can improve precision.
- Check for spillovers; if present, cluster/aggregate assignment is needed.
- Estimation & inference:
- Point estimators: simple mean differences; standard errors via Neyman variance or heteroskedasticity-robust HC2/HC3 (authors use HC3 and wild bootstrap to control multiple testing).
- Design validated in auction simulation and applied to a live Meta experiment.
- Empirical finding (Meta experiment):
- Algorithmic channel (NIE): +2.07 ppt female impression share.
- Creative channel (NDE): −0.68 ppt.
- Total effect: +1.39 ppt.
- Algorithm concentrated new impressions on women aged 35–44 and reallocated away from 65+.
- Conventional two-arm test misattributes the whole shift to creative and understates the algorithmic role by ~2×.
Data & Methods
- Conceptual framing: Treat the platform delivery mechanism S as a mediator in the potential-outcomes / Pearl–Robins causal-mediation framework. Use a parallel/controlled-mediator design analogous to Imai et al. (2013).
- Three-arm randomized experiment deployed on Meta for a campaign where treatment was adding a women-targeted text fragment.
- Implementation detail: Arm 2 required manipulating the metadata signal available to the platform so that S in arm 2 matched the marginal distribution of S observed under B; equality of distributions is empirically tested (e.g., KS test).
- Inference:
- Use group mean differences for NIE, NDE, and TE.
- Robust standard errors (HC3) and wild bootstrap (Rademacher weights) for multiple-testing correction and small-cluster adjustments.
- Diagnostics for within-arm dispersion and heteroskedasticity; day-level clustering checks.
- Validation:
- Calibrated auction simulation (Meta-style second-price auction with matching signals) where true NIE/NDE/TE are known — used to validate that the three-arm design recovers the components.
- Field experiment producing the substantive numerical decomposition reported above.
- Limitations/assumptions called out by authors:
- The experiment identifies population-average (marginal) natural direct/indirect effects (controlled/marginal CDE), not unit-level counterfactuals E[Y(a, S_i(b))] without untestable cross-world assumptions.
- Requires the capacity to manipulate the metadata signal and to enforce or emulate the mediator distribution of arm B in arm 2.
Implications for AI Economics
- Measurement and decision-making:
- Many advertiser decisions based on two-arm A/B tests (e.g., scaling a creative) may be misdirected when platform algorithms reallocate impressions; the three-arm design allows managers to separate whether performance gains come from creative change or algorithmic reallocation.
- Better allocation of budget: if gains are predominantly algorithmic, scaling creative content is less effective than leveraging the signal or targeting strategy that triggers algorithmic behavior.
- Platform accountability and regulation:
- The design gives a causal estimand for the platform’s contribution to differential exposure that is separable from advertiser creative choices. That quantity is directly relevant for fairness audits and regulatory inquiries in sensitive categories (housing, credit, employment).
- Enables evidence-based attribution of disparate exposure to platform-side algorithms versus advertiser content, informing liability and mitigation strategies.
- Experimental practice and IS research:
- Adds a practical experimental tool to the toolkit for researchers studying algorithmic bias and divergent delivery; the assumptions are experimentally testable rather than purely observational.
- Encourages platforms and advertisers to instrument metadata signals to enable such decompositions; suggests a standard approach for future platform-run experiments and academic audits.
- Policy and market design:
- If algorithmic channel dominates reallocation (as found), regulation focused only on advertiser content may miss a large source of disparate exposure; interventions might need to target platform delivery rules or require transparency/controls over signals the platform uses.
- Platforms could offer advertisers a way to choose whether the delivery algorithm conditions on particular metadata, enabling more granular control over equity implications.
- Costs and trade-offs:
- The three-arm design reduces per-contrast power vs a two-arm test and requires operational capability to manipulate metadata and enforce distributional matching—these are practical costs that must be balanced against the benefit of causal decomposition.
- Inference still yields population-average decompositions; some policy questions might require stronger (and untestable) assumptions to recover unit-level effects.
- Research directions:
- Apply the design across different platforms, outcomes (conversions, spend, geographic distribution), and protected attributes to map when algorithmic vs creative channels dominate.
- Extend to clustered/spillover settings and longer-run equilibrium feedbacks of algorithmic delivery.
- Combine with sensitivity analyses to bound unit-level cross-world deviations where those are of interest.
Overall, the paper provides a pragmatic, testable experimental solution to a widespread misattribution problem in platform A/B tests and establishes a clear causal object for measuring the platform’s role in differential exposure—an advance with direct managerial, regulatory, and research implications in AI-driven markets.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Online advertising platforms host hundreds of thousands of A/B tests. Other | positive | high | count of A/B tests hosted on platforms |
0.6
|
| The platform's delivery algorithm routes each creative to the audience it predicts will engage. Task Allocation | positive | high | audience routing by delivery algorithm |
0.6
|
| Every two-arm test conflates the creative's effect with the algorithm's targeting response. Decision Quality | negative | high | confounding/bias in estimated creative effect |
0.6
|
| Adjusting for the realized audience is biased because audience is a post-treatment mediator. Decision Quality | negative | high | bias from post-treatment adjustment |
0.6
|
| We propose a three-arm design that adds an arm exposing the algorithm to the treatment metadata while holding the user-facing creative identical to control, point-identifying the natural indirect (algorithmic) and direct (creative) effects without sequential ignorability. Research Productivity | positive | high | identification of natural indirect and direct effects |
0.6
|
| In a live Meta campaign with a women-targeted text fragment, the algorithmic channel raises female impression share by +2.07 ppt. Task Allocation | positive | high | female impression share (change attributable to algorithmic channel) |
+2.07 ppt
0.6
|
| In the same campaign, the creative channel moves female impression share by -0.68 ppt. Task Allocation | negative | high | female impression share (change attributable to creative channel) |
-0.68 ppt
0.6
|
| Roughly three-quarters of the absolute reallocation is algorithmic. Task Allocation | positive | high | share of total impression reallocation attributable to algorithm |
roughly three-quarters
0.6
|
| A conventional two-arm test understates the algorithmic channel by a factor of two. Decision Quality | negative | high | bias/understatement factor in estimated algorithmic effect from two-arm test |
factor of two
0.6
|
| The design isolates the contribution of the platform's algorithm to the outcome which is separable from creative content. Research Productivity | positive | high | isolated contribution of algorithm to outcome |
0.6
|