A general-purpose LLM-driven optimizer outperforms specialized tools across six domains—tripling ARC-AGI accuracy (32.5% to 89.5%), cutting cloud scheduling costs by ~40%, and generating CUDA kernels that match or beat PyTorch in 87% of cases; multi-task search and actionable side information materially accelerate and improve outcomes.
Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score-only feedback, and that multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework. We open-source optimize\_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa .
Summary
Main Finding
A single, domain-agnostic LLM-based system—optimize_anything—can treat many optimization problems as “improve this text artifact” and, using a unified API and Pareto-based LLM search, match or exceed specialized tools across fundamentally different domains. The system achieves large practical gains (e.g., nearly tripling ARC-AGI accuracy, cutting cloud costs by ~40%, and producing CUDA kernels that match or beat PyTorch) and shows that side information (diagnostic feedback) and multi-task search materially accelerate and improve optimization.
Key Points
- Unified formulation: Any solution that can be serialized as text (prompts, code, agent architectures, policies, SVGs, etc.) is an artifact x ∈ X. An evaluator f(x, e) returns a score and optional Side Information (SI). The same API handles single-task, multi-task, and generalization modes.
- Three modes:
- Single-task: optimize one artifact for one problem (e.g., algorithm design).
- Multi-task: optimize across a dataset of related tasks, sharing a Pareto frontier for cross-task transfer; yields specialized artifacts per task.
- Generalization: train on a dataset and validate on held-out examples to produce a single artifact that generalizes.
- Side Information (SI) is first-class: evaluators can return text diagnostics, structured sub-scores, images, traces, profiler output, etc. SI yields targeted LLM reflection and repairs rather than blind scalar-driven search.
- Pareto-based search: maintain per-example/per-metric scores and a Pareto frontier so complementary strengths are preserved; reflect on small minibatches (2–3 examples) to generate focused improvements. Selection for mutation is proportional to how often candidates dominate across objectives.
- Practical algorithmic plumbing: refiner step to catch formatting/syntax errors, content-addressed evaluation caching, backend-agnostic adapter layer, and support for multimodal SI (for VLMs).
- Key empirical results (across six primary domains):
- Agent architectures: ARC-AGI accuracy from 32.5% → 89.5% (≈+57 percentage points).
- Cloud scheduling: discovered algorithms cut cloud costs by up to 40.2%.
- CUDA kernels: 87% of generated kernels match or beat PyTorch baselines (multi-task > single-task).
- Prompt optimization (AIME): GPT-4.1-mini accuracy 46.67% → 60.00%.
- Agent skills for codebase tasks: pass rates to 98.3% and 100% on two models; resolution time reduced by 47%.
- Circle packing: outperforms AlphaEvolve under matched conditions.
- Ablations:
- SI vs score-only: SI produces 4–6× faster convergence and substantially higher final scores.
- Multi-task vs single-task: multi-task search outperforms independent single-task optimization given equivalent per-problem budget; benefits scale with the number of related tasks.
- Open-source release: optimize_anything as part of the GEPA project (link in paper).
Data & Methods
- Problem formalization: artifact space X (strings); evaluator f(x,e) → (score s(x,e), SI ι(x,e)). Modes correspond to whether E (examples) is empty, a dataset D, or Dtrain/Dval.
- Optimization backend: GEPA-inspired Pareto evolutionary search extended for arbitrary text artifacts; supports alternative backends via adapter layer.
- Candidate lifecycle:
- Maintain pool of candidates and a Pareto frontier defined over per-example/per-metric objectives.
- Select candidates for mutation based on dominance frequency across objectives.
- For selected candidate and a minibatch M: execute, gather scores+SI, present to proposer LLM in a reflection prompt, get proposed modification, run refiner, re-evaluate, and add improved candidates.
- SI types used across domains: compiler errors, runtime traces, per-test-case results, profiler summaries, rendered images (SVG/3D) for VLM-enabled proposers.
- Concrete domains & evaluations (representative):
- Agent skills and coding agents: evaluated on repository tasks; SI included traces, test outcomes, and runtime.
- Cloud scheduling: evaluated by cost metrics on simulated/real workload traces.
- CUDA kernels: tasks were PyTorch ops from KernelBench; evaluated for correctness and performance vs PyTorch.
- Prompts: evaluated on held-out benchmarks (AIME).
- Circle packing & mathematical optimization: single-task evolutionary comparisons (including direct reruns of baselines).
- Ablation methodology: compare SI vs score-only, multi-task vs single-task with matched per-problem budgets, proposer-model sensitivity, and cost tradeoffs. Quantified convergence speedups (4–6×) and final-performance differences.
Implications for AI Economics
-
Direct cost and productivity effects
- Substantial cost savings at the infra level: a reported ~40% cloud-cost reduction from discovered schedulers suggests non-trivial operational savings if adopted. Firms running large-scale cloud workloads could see sizeable bottom-line impacts.
- Developer productivity: automated improvement of code, prompts, and agent architectures (faster resolution, higher pass rates) implies reduced engineering time per task and faster iteration cycles.
- R&D acceleration: the ability to discover algorithms and kernel implementations automatically lowers the marginal cost of experimentation and may shorten innovation cycles.
-
Labor market and skill complementarities
- Augmentation vs substitution: optimize_anything is likely to complement engineering roles by automating routine or exploratory optimization and bug-fixing; however, automation of design/search tasks could substitute for some optimization-specialist roles (performance engineers, certain R&D engineering tasks).
- Changing skill mix: demand may shift toward roles that build evaluators, domain-specific SI pipelines, and assess and integrate LLM-discovered artifacts (validation, robustness, deployment).
-
Market structure and concentration
- Dependence on high-quality proposers: results hinge on powerful LLM proposers (Gemini, GPT-5, Claude). This could reinforce incumbency of firms hosting large LLMs (compute and model providers), concentrating value.
- Democratization potential: the declarative API and seedless mode lower barriers for non-experts to optimize artifacts; open-source tooling may partially offset concentration, but compute costs remain a gating factor.
-
Compute demand and price externalities
- Increased search/rollouts raise compute consumption: multi-task searches that evaluate many candidates across tasks can substantially increase GPU/CPU usage; widespread adoption could raise aggregate demand for compute and energy.
- Potential downward pressure on per-unit prices from efficiency gains may be offset by higher aggregate demand; equilibrium effects need empirical study.
-
Firm strategy and competitive advantage
- Firms that internalize optimize_anything-like workflows could quickly iterate on systems-level software (schedulers, kernels, agents) and extract cost or performance advantages, intensifying competition.
- Proprietary evaluators and curated SI datasets become strategic assets: better evaluators (more informative SI) enable faster convergence and better outputs, creating lock-in.
-
Policy, safety, and verification
- Verification costs increase: automatically generated algorithms and code require rigorous testing and audit; regulators and firms must invest in verification pipelines.
- Risk of brittle or unsafe solutions: especially in high-stakes applications (scheduling critical infrastructure, safety-critical agents), reliance on automatically discovered solutions raises governance needs.
-
Research & measurement agenda for economists
- Firm-level experiments: randomized rollout of optimize_anything vs baseline optimization workflows to quantify cost savings, productivity, and error rates.
- Compute demand modeling: measure incremental compute and energy use per unit of automated optimization and model macro impacts on cloud pricing.
- Labor impact studies: track occupational task content changes and wage premia for roles in designing evaluators vs roles replaced/augmented.
- Market concentration analyses: study whether proprietary access to top-tier LLM proposers produces persistent competitive advantage, and evaluate countervailing open-source & caching effects.
- Welfare accounting: weigh consumer & producer surplus from lower cloud costs and faster innovation against energy externalities, job displacement, and safety risks.
-
Practical takeaways for stakeholders
- For firms: invest in building high-quality evaluators and SI capture (this is where most marginal value lies), and pilot multi-task optimization to amplify cross-task learning.
- For cloud providers/infrastructure teams: anticipate both demand shifts (more search workloads) and opportunities to offer differentiated scheduling optimizations (or managed optimize_anything services) as revenue streams.
- For policymakers: consider standards and auditability requirements for automatically-generated system-level code and algorithms, and monitor compute-market concentration.
Overall, optimize_anything demonstrates that LLM-based, SI-driven, Pareto-aware optimization is a general-purpose tool with immediate, measurable economic consequences: it can reduce operational costs, change the nature of technical work, and alter competition and compute markets. Key economic questions remain about adoption dynamics, aggregate compute impacts, labor substitution vs complementarity, and how benefits/drawbacks distribute across firms and workers.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| A single AI-based optimization system achieves state-of-the-art results across six diverse tasks. Output Quality | positive | high | task performance / state-of-the-art accuracy across six tasks |
n=6
0.12
|
| The system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%). Output Quality | positive | high | ARC-AGI accuracy |
32.5% to 89.5%
0.12
|
| The system finds scheduling algorithms that cut cloud costs by 40%. Organizational Efficiency | positive | high | cloud cost (monetary cost) reduction |
cut cloud costs by 40%
0.12
|
| The system generates CUDA kernels where 87% match or beat PyTorch. Output Quality | positive | high | proportion of generated CUDA kernels that match or beat PyTorch performance |
87% match or beat PyTorch
0.12
|
| The system outperforms AlphaEvolve's reported circle packing solution (n=26). Output Quality | positive | high | circle packing solution quality (optimization objective) |
n=26
0.12
|
| Ablations across three domains reveal that actionable side information yields faster convergence than score-only feedback. Task Completion Time | positive | high | convergence speed (time or iterations to converge) |
n=3
0.12
|
| Ablations across three domains reveal that actionable side information yields substantially higher final scores than score-only feedback. Output Quality | positive | high | final optimization score |
n=3
0.12
|
| Multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Task Allocation | positive | high | optimization performance (e.g., score) under multi-task vs independent optimization |
0.12
|
| Text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework (claimed as a first-time result). Innovation Output | positive | medium | generality / applicability of LLM-based text optimization across problem types |
n=6
0.01
|
| The authors open-source optimize_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa. Adoption Rate | positive | high | availability of open-source code / tooling |
0.2
|