Learning to Present: Inverse Specification Rewards for Agentic Slide Generation

Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience-aware communication. This work proposes an OpenEnv-compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi-component reward system combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics, and an inverse specification reward that measures how faithfully generated slides convey their intended purpose. The inverse specification reward, an "inverse task" where an LLM attempts to recover the original specification from generated slides, provides a holistic quality signal. Our approach fine-tunes Qwen2.5-Coder-7B via GRPO, training only 0.5% of parameters on prompts derived from expert demonstrations collected using Claude Opus 4.6. Experiments on 48 diverse business briefs across six models demonstrate that our fine-tuned 7B model achieves 91.2% of Claude Opus 4.6's quality while improving 33.1% over the base model. The six-model comparison reveals that instruction adherence and tool-use compliance, rather than raw parameter count, determine agentic task performance. We contribute SlideRL, an open-source dataset of 288 multi-turn rollout trajectories across all six models: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts Code: https://github.com/pushing-the-frontier/slide-forge-llm

Summary

Main Finding

Fine-tuning a parameter-efficient 7B model (Qwen2.5-Coder-7B) via reinforcement learning in an OpenEnv-compatible environment yields near-state-of-the-art automated slide-generation: the tuned 7B model reaches 91.2% of Claude Opus 4.6’s quality and improves 33.1% over the same base model without RL fine-tuning. Crucially, performance on this agentic task is driven more by instruction adherence and tool-use compliance than by raw model parameter count.

Key Points

Task and environment
- Automated generation of professional HTML slide presentations that require research, content planning, visual design, and audience-aware communication.
- Implemented in an OpenEnv-compatible RL environment where LLM agents use tools to research, plan, and render slides.
Reward design
- Multi-component reward combines:
  - Structural validation (format/structure checks)
  - Render quality assessment (how well generated HTML renders)
  - LLM-based aesthetic scoring
  - Content quality metrics (factuality, coverage, coherence)
  - Inverse specification reward: an LLM attempts to recover the original brief/spec from the generated slides; accuracy of recovery provides a holistic fidelity signal.
Training setup
- Fine-tuned Qwen2.5-Coder-7B via GRPO, training only 0.5% of parameters (parameter-efficient fine-tuning).
- Training prompts derived from expert demonstrations collected using Claude Opus 4.6.
Empirical results
- Evaluation on 48 diverse business briefs across six models.
- Fine-tuned 7B model achieves 91.2% of Claude Opus 4.6 quality and +33.1% over the untuned base 7B.
- Cross-model comparison shows instruction adherence and tool-use compliance predict agentic task performance better than parameter count.
Artifacts released
- SlideRL dataset: 288 multi-turn rollout trajectories across six models.
  - Dataset: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts
- Code: https://github.com/pushing-the-frontier/slide-forge-llm

Data & Methods

Data
- 48 business briefs used as evaluation tasks, chosen to be diverse across business use-cases.
- Expert demonstration prompts collected from Claude Opus 4.6 used to bootstrap training data.
- SlideRL dataset: 288 multi-turn rollouts (trajectories) spanning six models for reproducibility and analysis.
Methods
- Environment: OpenEnv-compatible RL environment enabling tool use (web/knowledge access, rendering pipeline) and multi-turn planning.
- Agent architecture: Qwen2.5-Coder-7B base model, parameter-efficient fine-tuning (0.5% of params) through GRPO.
- Reward function: composite signal combining structural, render, aesthetic, content, and inverse-specification rewards to capture both low-level correctness and high-level fidelity.
- Evaluation: human-quality proxies and comparisons against Claude Opus 4.6 and other baseline models across the 48 briefs; measured relative quality and improvements over base models.

Implications for AI Economics

Cost-effectiveness & deployment strategy
- Parameter-efficient RL fine-tuning (0.5% of params) can yield large quality gains, implying high ROI for targeted fine-tuning versus full-model scaling. Firms can achieve near top-tier performance with smaller, cheaper models plus tailored training and tooling.
Returns to alignment and tooling
- Greater gains stem from instruction adherence and tool-use compliance than raw scale, suggesting investments in instruction engineering, tool integration, and evaluation/reward design may produce larger marginal returns than increasing parameter counts.
Labor and productivity
- High-quality automated slide generation reduces time spent on business presentation creation and design; potential for productivity gains and partial substitution of routine creative/knowledge-worker tasks. Economic impact will depend on adoption, integration into workflows, and limits of generalization beyond evaluated briefs.
Market structure & competition
- If smaller tuned models can capture most of the performance of much larger systems, market power may shift: specialized, cheaper models plus toolchains could undercut demand for large general-purpose models, promoting niche competition and verticalized offerings.
Evaluation & measurement
- The inverse specification reward offers a domain-agnostic, holistic metric for fidelity to user intent; economists and firms should consider such task-recovery-based evaluation metrics when measuring model value and service quality.
Policy & externalities
- Easier and cheaper deployment of capable agentic systems raises questions about misuse (e.g., persuasive content generation) and labor market effects; policymakers should monitor adoption patterns and consider upskilling/transition support in affected occupations.
Research & reproducibility
- Open dataset and code improve reproducibility and lower barriers for follow-up work; valuable for empirical work on diffusion, adoption, and economic impacts of applied LLM tools.

If you’d like, I can (a) extract the precise evaluation metrics and relative scores for all six compared models; (b) outline an empirical design to estimate labor substitution effects from deploying such slide-generation agents in firms; or (c) draft policy recommendations for monitoring economic impacts.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides strong experimental evidence that parameter-efficient RL fine-tuning materially improves task-specific LLM performance (clear within-sample comparisons, ablations, and public artifacts). However, claims about economic impacts (cost-effectiveness, labor substitution, market structure) are inferential and not supported by causal, field-level economic measurement, limiting strength for AI-economics conclusions. Methods Rigormedium — Methodological strengths include an OpenEnv agentic evaluation environment, a multi-component reward design (including an inverse-specification signal), cross-model comparisons, and released code/data; weaknesses include a small held-out task set (48 briefs), a modest set of rollout trajectories (288), reliance on expert demonstrations from a single high-end model (Claude Opus 4.6) that could introduce bias, use of LLM-based quality proxies rather than large-scale human or field experiments, and limited robustness checks for overfitting or reward hacking. SampleEvaluation on 48 diverse business briefs; SlideRL dataset of 288 multi-turn rollout trajectories across six models (including base and fine-tuned Qwen2.5-Coder-7B and Claude Opus 4.6 demonstrations); training prompts/demonstrations collected from Claude Opus 4.6; fine-tuning via GRPO of Qwen2.5-Coder-7B with parameter-efficient updates (~0.5% of parameters). Themesproductivity human_ai_collab adoption GeneralizabilityLimited number of evaluation tasks (48 briefs) may not capture broader business or creative contexts., Domain-specific (professional HTML slide generation) — findings may not transfer to other agentic tasks or modalities., Demonstrations sourced from a single high-end model (Claude Opus 4.6) may bias learned behaviors and limit external validity., Environment is OpenEnv-compatible simulation with tool integrations; real-world deployment frictions and user workflows not measured., Quality evaluation relies partly on LLM-based proxies and render checks rather than large-scale human-in-the-loop productivity measurements., Language, cultural, and sectoral diversity of briefs not specified; results may not generalize across languages or industries., Short-term evaluation; durability, calibration drift, and adversarial/edge-case behavior over time not assessed.

Claims (17)

Claim	Direction	Confidence	Outcome	Details
Fine-tuning a parameter-efficient 7B model (Qwen2.5-Coder-7B) via reinforcement learning in an OpenEnv-compatible environment yields near-state-of-the-art automated slide-generation: the tuned 7B model reaches 91.2% of Claude Opus 4.6’s quality. Output Quality	positive	high	Relative slide-generation quality (percent of Claude Opus 4.6 quality) across 48 briefs	n=48 91.2% 0.18
The RL fine-tuned Qwen2.5-Coder-7B improves 33.1% over the same base 7B model without RL fine-tuning. Output Quality	positive	high	Absolute or relative quality improvement (%) of tuned vs. untuned Qwen2.5-Coder-7B	n=48 33.1% 0.18
Performance on this agentic slide-generation task is driven more by instruction adherence and tool-use compliance than by raw model parameter count. Output Quality	positive	medium	Predictive strength (correlation/importance) of instruction adherence and tool-use compliance vs. model parameter count for slide-generation quality	n=48 0.11
Fine-tuning was done parameter-efficiently: only 0.5% of the Qwen2.5-Coder-7B parameters were trained using GRPO. Training Effectiveness	null_result	high	Proportion of model parameters updated during training (0.5%)	0.5% 0.18
Training prompts were derived from expert demonstrations collected using Claude Opus 4.6 to bootstrap training data. Other	null_result	high	Source of demonstration prompts (Claude Opus 4.6)	0.18
Evaluation was conducted on 48 diverse business briefs across six models. Other	null_result	high	Number of evaluation tasks (48 briefs) and number of models compared (6)	n=48 0.18
The SlideRL dataset of 288 multi-turn rollout trajectories across six models is released for reproducibility. Other	null_result	high	Number of rollout trajectories in dataset (288) and coverage across models (6)	n=288 288 trajectories 0.18
Code for the environment and experiments is released at the specified GitHub repository. Other	null_result	high	Availability of experiment code (GitHub repo)	0.18
The RL environment is OpenEnv-compatible and enables agent tool use for web/knowledge access, planning, and a rendering pipeline. Other	null_result	high	Environment capabilities: OpenEnv compatibility and tool-use support	0.18
The reward function is a composite multi-component signal combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics (factuality, coverage, coherence), and an inverse-specification reward. Other	null_result	high	Components of the reward signal used for RL training	0.18
The inverse-specification reward—where an LLM attempts to recover the original brief from generated slides—provides a holistic fidelity signal. Output Quality	positive	medium	Accuracy of recovering original brief from generated slides (used as fidelity signal)	0.11
Human-quality proxies were used for evaluation and comparisons were made against Claude Opus 4.6 and other baselines. Other	null_result	high	Human-quality proxy scores and comparative model rankings	0.18
Parameter-efficient RL fine-tuning (0.5% of params) can yield large quality gains, implying a potentially high ROI for targeted fine-tuning versus full-model scaling. Firm Productivity	positive	medium	Quality gains after parameter-efficient fine-tuning and implied cost-effectiveness (ROI inference)	n=48 0.11
High-quality automated slide generation has potential to reduce time spent on business presentation creation and produce productivity gains with partial substitution of routine creative/knowledge-worker tasks. Firm Productivity	positive	low	Potential time savings/productivity gains (not directly measured in the study)	0.05
If smaller tuned models can capture most performance of much larger systems, market power may shift toward specialized, cheaper models plus toolchains, promoting niche competition and verticalized offerings. Market Structure	mixed	speculative	Market-structure shifts and competitive dynamics (speculative, not directly measured)	0.02
The inverse-specification reward offers a domain-agnostic, holistic metric for fidelity to user intent and is recommended for measurement of model value/service quality. Other	positive	low	Utility of inverse-specification recovery accuracy as a fidelity metric (conceptual/recommendation)	0.05
Open dataset and code improve reproducibility and lower barriers for follow-up work on applied LLM tools and economic impact studies. Research Productivity	positive	medium	Availability of artifacts that can be used to reproduce/extend the work	0.11

A 7B model fine-tuned with reinforcement learning produces slide decks nearly as good as a top-tier system (91.2% of Claude Opus 4.6) and improves 33% over its untuned variant, suggesting targeted alignment and tooling deliver large returns that scaling alone would not.