A 7B model fine-tuned with reinforcement learning produces slide decks nearly as good as a top-tier system (91.2% of Claude Opus 4.6) and improves 33% over its untuned variant, suggesting targeted alignment and tooling deliver large returns that scaling alone would not.
Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience-aware communication. This work proposes an OpenEnv-compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi-component reward system combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics, and an inverse specification reward that measures how faithfully generated slides convey their intended purpose. The inverse specification reward, an "inverse task" where an LLM attempts to recover the original specification from generated slides, provides a holistic quality signal. Our approach fine-tunes Qwen2.5-Coder-7B via GRPO, training only 0.5% of parameters on prompts derived from expert demonstrations collected using Claude Opus 4.6. Experiments on 48 diverse business briefs across six models demonstrate that our fine-tuned 7B model achieves 91.2% of Claude Opus 4.6's quality while improving 33.1% over the base model. The six-model comparison reveals that instruction adherence and tool-use compliance, rather than raw parameter count, determine agentic task performance. We contribute SlideRL, an open-source dataset of 288 multi-turn rollout trajectories across all six models: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts Code: https://github.com/pushing-the-frontier/slide-forge-llm
Summary
Main Finding
Fine-tuning a parameter-efficient 7B model (Qwen2.5-Coder-7B) via reinforcement learning in an OpenEnv-compatible environment yields near-state-of-the-art automated slide-generation: the tuned 7B model reaches 91.2% of Claude Opus 4.6’s quality and improves 33.1% over the same base model without RL fine-tuning. Crucially, performance on this agentic task is driven more by instruction adherence and tool-use compliance than by raw model parameter count.
Key Points
- Task and environment
- Automated generation of professional HTML slide presentations that require research, content planning, visual design, and audience-aware communication.
- Implemented in an OpenEnv-compatible RL environment where LLM agents use tools to research, plan, and render slides.
- Reward design
- Multi-component reward combines:
- Structural validation (format/structure checks)
- Render quality assessment (how well generated HTML renders)
- LLM-based aesthetic scoring
- Content quality metrics (factuality, coverage, coherence)
- Inverse specification reward: an LLM attempts to recover the original brief/spec from the generated slides; accuracy of recovery provides a holistic fidelity signal.
- Multi-component reward combines:
- Training setup
- Fine-tuned Qwen2.5-Coder-7B via GRPO, training only 0.5% of parameters (parameter-efficient fine-tuning).
- Training prompts derived from expert demonstrations collected using Claude Opus 4.6.
- Empirical results
- Evaluation on 48 diverse business briefs across six models.
- Fine-tuned 7B model achieves 91.2% of Claude Opus 4.6 quality and +33.1% over the untuned base 7B.
- Cross-model comparison shows instruction adherence and tool-use compliance predict agentic task performance better than parameter count.
- Artifacts released
- SlideRL dataset: 288 multi-turn rollout trajectories across six models.
- Dataset: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts
- Code: https://github.com/pushing-the-frontier/slide-forge-llm
- SlideRL dataset: 288 multi-turn rollout trajectories across six models.
Data & Methods
- Data
- 48 business briefs used as evaluation tasks, chosen to be diverse across business use-cases.
- Expert demonstration prompts collected from Claude Opus 4.6 used to bootstrap training data.
- SlideRL dataset: 288 multi-turn rollouts (trajectories) spanning six models for reproducibility and analysis.
- Methods
- Environment: OpenEnv-compatible RL environment enabling tool use (web/knowledge access, rendering pipeline) and multi-turn planning.
- Agent architecture: Qwen2.5-Coder-7B base model, parameter-efficient fine-tuning (0.5% of params) through GRPO.
- Reward function: composite signal combining structural, render, aesthetic, content, and inverse-specification rewards to capture both low-level correctness and high-level fidelity.
- Evaluation: human-quality proxies and comparisons against Claude Opus 4.6 and other baseline models across the 48 briefs; measured relative quality and improvements over base models.
Implications for AI Economics
- Cost-effectiveness & deployment strategy
- Parameter-efficient RL fine-tuning (0.5% of params) can yield large quality gains, implying high ROI for targeted fine-tuning versus full-model scaling. Firms can achieve near top-tier performance with smaller, cheaper models plus tailored training and tooling.
- Returns to alignment and tooling
- Greater gains stem from instruction adherence and tool-use compliance than raw scale, suggesting investments in instruction engineering, tool integration, and evaluation/reward design may produce larger marginal returns than increasing parameter counts.
- Labor and productivity
- High-quality automated slide generation reduces time spent on business presentation creation and design; potential for productivity gains and partial substitution of routine creative/knowledge-worker tasks. Economic impact will depend on adoption, integration into workflows, and limits of generalization beyond evaluated briefs.
- Market structure & competition
- If smaller tuned models can capture most of the performance of much larger systems, market power may shift: specialized, cheaper models plus toolchains could undercut demand for large general-purpose models, promoting niche competition and verticalized offerings.
- Evaluation & measurement
- The inverse specification reward offers a domain-agnostic, holistic metric for fidelity to user intent; economists and firms should consider such task-recovery-based evaluation metrics when measuring model value and service quality.
- Policy & externalities
- Easier and cheaper deployment of capable agentic systems raises questions about misuse (e.g., persuasive content generation) and labor market effects; policymakers should monitor adoption patterns and consider upskilling/transition support in affected occupations.
- Research & reproducibility
- Open dataset and code improve reproducibility and lower barriers for follow-up work; valuable for empirical work on diffusion, adoption, and economic impacts of applied LLM tools.
If you’d like, I can (a) extract the precise evaluation metrics and relative scores for all six compared models; (b) outline an empirical design to estimate labor substitution effects from deploying such slide-generation agents in firms; or (c) draft policy recommendations for monitoring economic impacts.
Assessment
Claims (17)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Fine-tuning a parameter-efficient 7B model (Qwen2.5-Coder-7B) via reinforcement learning in an OpenEnv-compatible environment yields near-state-of-the-art automated slide-generation: the tuned 7B model reaches 91.2% of Claude Opus 4.6’s quality. Output Quality | positive | high | Relative slide-generation quality (percent of Claude Opus 4.6 quality) across 48 briefs |
n=48
91.2%
0.18
|
| The RL fine-tuned Qwen2.5-Coder-7B improves 33.1% over the same base 7B model without RL fine-tuning. Output Quality | positive | high | Absolute or relative quality improvement (%) of tuned vs. untuned Qwen2.5-Coder-7B |
n=48
33.1%
0.18
|
| Performance on this agentic slide-generation task is driven more by instruction adherence and tool-use compliance than by raw model parameter count. Output Quality | positive | medium | Predictive strength (correlation/importance) of instruction adherence and tool-use compliance vs. model parameter count for slide-generation quality |
n=48
0.11
|
| Fine-tuning was done parameter-efficiently: only 0.5% of the Qwen2.5-Coder-7B parameters were trained using GRPO. Training Effectiveness | null_result | high | Proportion of model parameters updated during training (0.5%) |
0.5%
0.18
|
| Training prompts were derived from expert demonstrations collected using Claude Opus 4.6 to bootstrap training data. Other | null_result | high | Source of demonstration prompts (Claude Opus 4.6) |
0.18
|
| Evaluation was conducted on 48 diverse business briefs across six models. Other | null_result | high | Number of evaluation tasks (48 briefs) and number of models compared (6) |
n=48
0.18
|
| The SlideRL dataset of 288 multi-turn rollout trajectories across six models is released for reproducibility. Other | null_result | high | Number of rollout trajectories in dataset (288) and coverage across models (6) |
n=288
288 trajectories
0.18
|
| Code for the environment and experiments is released at the specified GitHub repository. Other | null_result | high | Availability of experiment code (GitHub repo) |
0.18
|
| The RL environment is OpenEnv-compatible and enables agent tool use for web/knowledge access, planning, and a rendering pipeline. Other | null_result | high | Environment capabilities: OpenEnv compatibility and tool-use support |
0.18
|
| The reward function is a composite multi-component signal combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics (factuality, coverage, coherence), and an inverse-specification reward. Other | null_result | high | Components of the reward signal used for RL training |
0.18
|
| The inverse-specification reward—where an LLM attempts to recover the original brief from generated slides—provides a holistic fidelity signal. Output Quality | positive | medium | Accuracy of recovering original brief from generated slides (used as fidelity signal) |
0.11
|
| Human-quality proxies were used for evaluation and comparisons were made against Claude Opus 4.6 and other baselines. Other | null_result | high | Human-quality proxy scores and comparative model rankings |
0.18
|
| Parameter-efficient RL fine-tuning (0.5% of params) can yield large quality gains, implying a potentially high ROI for targeted fine-tuning versus full-model scaling. Firm Productivity | positive | medium | Quality gains after parameter-efficient fine-tuning and implied cost-effectiveness (ROI inference) |
n=48
0.11
|
| High-quality automated slide generation has potential to reduce time spent on business presentation creation and produce productivity gains with partial substitution of routine creative/knowledge-worker tasks. Firm Productivity | positive | low | Potential time savings/productivity gains (not directly measured in the study) |
0.05
|
| If smaller tuned models can capture most performance of much larger systems, market power may shift toward specialized, cheaper models plus toolchains, promoting niche competition and verticalized offerings. Market Structure | mixed | speculative | Market-structure shifts and competitive dynamics (speculative, not directly measured) |
0.02
|
| The inverse-specification reward offers a domain-agnostic, holistic metric for fidelity to user intent and is recommended for measurement of model value/service quality. Other | positive | low | Utility of inverse-specification recovery accuracy as a fidelity metric (conceptual/recommendation) |
0.05
|
| Open dataset and code improve reproducibility and lower barriers for follow-up work on applied LLM tools and economic impact studies. Research Productivity | positive | medium | Availability of artifacts that can be used to reproduce/extend the work |
0.11
|