A new auto-bidding system that pairs directed exploration with a safety fallback lifts advertising performance: live Taobao tests show a 4.10% increase in ad GMV and a 3.52% rise in ad ROI while also raising clicks and spend.

Generative Auto-Bidding with Unified Modeling and Exploration

Mingming Zhang, Feiqing Zhuang, Na Li, Shengjie Sun, Xiaowei Chen, Junxiong Zhu, Fei Xiao, Keping Yang, Lixin Zou, Chenliang Li · May 19, 2026

arxiv other medium evidence 8/10 relevance Source PDF

GUIDE—an auto-bidding framework combining a Decision Transformer with Q-value–guided exploration and an inverse-dynamics safe fallback—outperforms baselines in offline, simulated, and live Taobao experiments, raising ad GMV by 4.10%, clicks by 1.40%, and ad ROI by 3.52%.

Automated bidding is central to modern digital advertising. Early rule-based methods lacked adaptability, while subsequent Reinforcement Learning approaches modeled bidding as a Markov Decision Process but struggled with long-term dependencies. Recent generative models show promise, yet they lack explicit mechanisms to balance exploration and safety, relying solely on action perturbations or trajectory guidance without a safety fallback. This results in inefficient exploration and elevated financial risk for advertising platforms. To address this gap, we propose GUIDE (Generative Auto-Bidding with Unified Modeling and Exploration), a framework that synergistically integrates directed exploration with a safe fallback mechanism. GUIDE employs a Decision Transformer (DT) to jointly model historical bidding actions and environmental state transitions. A Q-value module guides the DT's exploration via regularization constraints, while an Inverse Dynamics Module (IDM) leverages DT-predicted future states to infer robust, behaviorally consistent actions as a safe policy fallback. The Q-value module then adaptively selects the final action between these two options, balancing exploration and safety. Together, these components form an integrated "explore-safeguard-select" pipeline that unifies efficiency and safety. We conduct extensive experiments on public datasets, in simulated auction environments, and through large-scale online deployment on Taobao, a leading Chinese advertising platform. Results show GUIDE consistently outperforms state-of-the-art baselines across all scenarios. In real-world deployment, GUIDE achieves notable gains: +4.10% ad GMV, +1.40% ad clicks, +1.66% ad cost, and +3.52% ad ROI, demonstrating its effectiveness and strong industrial applicability.

Summary

Main Finding

The paper introduces Guide (Generative Auto-Bidding with Unified Modeling and Exploration), a unified generative framework for automated ad bidding that combines a Decision Transformer (DT) for long-horizon sequence modeling, an Inverse Dynamics Module (IDM) as a safe fallback, and a twin-Q-value module to guide exploration and select between exploratory and safe actions. This "explore–safeguard–select" design produces both more efficient exploration and reduced financial risk. Guide outperforms state-of-the-art baselines in offline and simulated tests and yields substantial real-world gains in a large-scale Taobao deployment (+4.10% ad GMV, +1.40% clicks, +1.66% ad cost, +3.52% ad ROI).

Key Points

Unified modeling: DT jointly generates the next action and the next environment state (ˆa_t, ˆs_{t+1}), which provides richer supervision and captures long-term dependencies better than MDP-only or action-only sequence models.
Safety fallback via inverse dynamics: IDM infers an action ˆa_idm_t from (s_t, ˆs_{t+1}); because IDM is trained to reproduce dataset behavior, it yields safer, behaviorally consistent actions that act as a fallback during risky exploration.
Directed exploration with value guidance: a twin-Q (double critic) module estimates Q(s,a). It both regularizes DT exploration (encouraging higher-value directions) and adaptively selects between the DT action and the IDM action at inference time, balancing exploration and safety.
Two-stage training: (1) separate pretraining of DT and IDM (preventing unstable gradients), then (2) joint training where IDM loss propagates into the DT to improve state realism and action-state consistency.
Architecture and losses: DT trained with behavior-cloning action loss plus state prediction loss; IDM trained by MSE between inferred and logged actions; twin-Q critics trained with TD targets using min(Q1,Q2) to reduce overestimation bias.
Empirical results: consistent SOTA improvements on public offline datasets and simulated auction environments; large-scale online A/B on Taobao showing meaningful business KPIs uplift noted above.

Data & Methods

Problem framing: auto-bidding as sequence modeling with states s_t (budget, time, past results, user/context features), actions a_t (e.g., dynamic multiplier λ_t), immediate rewards r_t, and return-to-go R_t. Trajectories τ = (s1,a1,r1,...,sT,aT,rT).
Decision Transformer backbone: conditions on historical (R, s, a) and predicts both ˆa_t and ˆs_{t+1} in one forward pass, enabling richer supervision and multi-step planning.
Inverse Dynamics Module (IDM): an MLP f_idm(s_t, ˆs_{t+1}) → ˆa_idm_t trained to minimize E[||ˆa_idm_t − a_t||^2]. During inference IDM uses DT-predicted ˆs_{t+1} to output safe actions.
Two-stage training:
- Phase 1: separate pretraining; DT optimized with action and state prediction losses, IDM trained on detached DT-predicted states (stop-gradient) to avoid unstable updates.
- Phase 2: joint training allowing gradients from IDM to flow into DT; combined loss L = L_DT + L_IDM.
Q-value module: twin critic networks Q1,Q2 with corresponding target networks; TD target y_t = r_t + γ(1−d_t) * min(Q_target1(s_{t+1}, a_{t+1}), Q_target2(...)). Critics trained from replay buffer; min operator used to mitigate overestimation.
Action selection & regularization: Q-values used to (a) regularize DT action generation during training toward higher-value actions, and (b) at inference select between DT-proposed exploratory action and IDM fallback action based on estimated Q(s,a), enabling adaptive exploration with safety.
Evaluation: experiments on public offline datasets and simulated auction environments (building on auction simulators / environment models in prior work), and a large-scale online A/B deployment on Taobao. Reported production improvements in GMV, clicks, cost, and ROI.

Implications for AI Economics

Practical trade-off resolution between exploration and financial risk: Guide provides a concrete mechanism to pursue policy improvement (exploration) while limiting downside via an empirically grounded fallback (IDM) and Q-based selection—important where experiments are monetized and risky.
Improved advertiser and platform efficiency: increases in GMV and ROI in production suggest better budget allocation and higher-quality traffic acquisition, with potential knock-on effects on advertiser surplus and platform revenue.
Market dynamics and competitive effects: more effective auto-bidding agents may intensify competition in auctions (higher bids or better match quality), potentially shifting equilibrium outcomes, advertiser strategies, and price dynamics. Platforms should monitor systemic impacts (e.g., winner payments, bidder churn).
Transferability to other high-stakes economic systems: the unified generative + inverse-dynamics + value-guidance pattern is applicable to other domains where exploration risk is costly—e.g., algorithmic trading, dynamic pricing, inventory control—where a safe behavioral fallback is essential.
Methodological and policy considerations:
- Simulation fidelity and distributional shift: performance depends on simulator/environment fidelity; value-guided exploration can still fail under unmodeled distributional changes—continuous online monitoring and safe rollback mechanisms remain crucial.
- Strategic behavior and externalities: upgraded bidding agents can trigger strategic responses from competing advertisers; platform designers should assess welfare and fairness implications.
- Regulatory & ethical oversight: financial-risk-mitigating exploration reduces some harms but automated bidding still raises transparency, accountability, and competition-policy questions.
Operational costs and adoption: the combined models (DT + IDM + twin Q) increase modeling and compute complexity; organizations must weigh inference latency and engineering costs against expected economic gains. Continuous evaluation on budget/CPA constraints and conservative deployment strategies are recommended.

Assessment

Paper Typeother Evidence Strengthmedium — The paper combines offline, simulation, and live platform experiments, including reported percentage gains in GMV, clicks, cost, and ROI from a large-scale deployment, which provides practical evidence of impact; however, the write-up does not present a transparent randomized controlled design, pre-registration, balance checks, or robustness to alternative explanations, leaving open concerns about selection, temporal confounding, and generalizability. Methods Rigorhigh — Methodologically the paper introduces a novel integrated architecture (Decision Transformer + Q-value regularization + Inverse Dynamics fallback), and evaluates it across public datasets, simulated auction environments, and live deployment, suggesting thorough ML experimentation and engineering rigor; incremental algorithmic choices are justified with ablations and cross-environment tests (as reported). SampleExperiments use (a) public advertising datasets for offline evaluation, (b) simulated auction environments to test behavioral and safety properties, and (c) large-scale live deployment on Taobao's advertising platform measuring platform-level metrics (ad GMV, clicks, ad cost, ad ROI); exact sample sizes and campaign selection criteria are not specified in the summary. Themesproductivity adoption IdentificationNo formal causal identification strategy is described; evaluation relies on offline benchmarking on public datasets, simulation experiments in auction environments, and large-scale online deployment (presumably A/B testing) on Taobao, with performance improvements reported on platform metrics. GeneralizabilityResults come from a single major e-commerce platform (Taobao) and may not generalize to other platforms, geographies, or ad formats., Platform-specific auction mechanics, bidder composition, and marketplace dynamics could limit transfer to different advertising ecosystems., Lack of detailed randomization/experiment design information limits confidence that observed gains will persist across time windows or diverse campaign types., Requires substantial platform telemetry and infrastructure—implementation feasibility may be limited for smaller advertisers or exchanges.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Early rule-based methods lacked adaptability. Other	negative	high	adaptability of early rule-based bidding methods	0.02
Reinforcement Learning approaches modeled bidding as a Markov Decision Process but struggled with long-term dependencies. Other	negative	high	ability of RL approaches to handle long-term dependencies	0.02
Recent generative models show promise, yet they lack explicit mechanisms to balance exploration and safety, relying solely on action perturbations or trajectory guidance without a safety fallback, resulting in inefficient exploration and elevated financial risk for advertising platforms. Other	negative	high	exploration efficiency and financial risk in generative-model-based auto-bidding	0.12
We propose GUIDE (Generative Auto-Bidding with Unified Modeling and Exploration), a framework that synergistically integrates directed exploration with a safe fallback mechanism. Other	positive	high	integration of directed exploration and safe fallback in an auto-bidding framework	0.02
GUIDE employs a Decision Transformer (DT) to jointly model historical bidding actions and environmental state transitions, a Q-value module to guide DT exploration via regularization constraints, and an Inverse Dynamics Module (IDM) that leverages DT-predicted future states to infer robust behaviorally consistent actions as a safe policy fallback. Other	positive	high	architectural design: DT + Q-value regularization + IDM fallback	0.02
We conduct extensive experiments on public datasets, in simulated auction environments, and through large-scale online deployment on Taobao. Other	null_result	high	scope and environments of experiments (public datasets, simulations, live deployment)	0.12
Results show GUIDE consistently outperforms state-of-the-art baselines across all scenarios. Firm Revenue	positive	high	performance relative to state-of-the-art baselines	0.12
In real-world deployment, GUIDE achieves notable gains: +4.10% ad GMV, +1.40% ad clicks, +1.66% ad cost, and +3.52% ad ROI. Firm Revenue	positive	high	ad GMV; ad clicks; ad cost; ad ROI	+4.10% ad GMV; +1.40% ad clicks; +1.66% ad cost; +3.52% ad ROI 0.2