A self-improving loop that lets a language-model agent rewrite its scaffold and fine-tune weights substantially outperforms prior siloed approaches across three technical domains, cutting GPU kernel runtimes by ~92% and delivering major gains in legal classification and RNA denoising; combining harness changes (to make the agent more agentic) with weight updates (to build domain intuition) yields much larger improvements than changing either alone.
Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The harness-update school has a meta-agent rewrite the scaffold of a task-specific agent (its tools, prompts, retry logic, and search procedure) while the model weights are held fixed. The test-time training school uses hand-written RL pipelines to update the model's own weights on task feedback while the harness is held fixed. These two silos operate in isolation. We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. We evaluate across three contrasting domains: Chinese legal charge classification, low-level GPU kernel optimisation, and single-cell RNA denoising. Combining both levers outperforms scaffold iteration alone on all three benchmarks. The gains are 56.6% on LawBench, 91.9% runtime reduction on GPU kernels, and 502% on denoising over the initial baseline. Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.
Summary
Main Finding
SIA (Self Improving AI) is a closed-loop system in which a Feedback-Agent both rewrites an agent’s harness/scaffold (prompts, tool-dispatch, retry/parse logic) and triggers test-time weight updates (LoRA adapters via RL). Across three diverse domains—Chinese legal charge classification (LawBench), low-level GPU kernel optimisation (TriMul), and single-cell RNA denoising—interleaving harness edits with weight updates outperforms harness-only iteration. Reported gains (over the initial baseline) are large: +56.6% on LawBench, a 91.9% runtime reduction on GPU kernels (12,483 → 1,017 ms; 14.02× over the unoptimised initial), and +502% on denoising. Ablations show weight updates add improvements beyond what scaffold evolution delivers.
Key Points
- Two prior “silos” addressed self-improvement separately:
- Harness/scaffold editing (meta-agents rewrite agent code/prompt/tool logic) while keeping model weights fixed.
- Test-time training (RL / fine-tuning) that updates model weights while holding the scaffold fixed.
- SIA unifies the levers: a Feedback-Agent inspects full execution trajectories (every prompt, response, tool call, grader result) and chooses, each iteration, whether to (a) synthesize an improved harness or (b) run a weight-update training step (LoRA adapters).
- Feedback-Agent decisions are dynamic and conditioned on observed reward dynamics; harness and weight updates are freely interleaved rather than strictly sequential.
- Mechanistic separation observed:
- Harness edits improve external infrastructure: parsing, retries, tooling, and engineering hygiene that shape agentic behavior/search.
- Weight updates embed domain-specific intuition and internal representations that no prompt or scaffold could fully provide—critical for fine-grained distinctions and sparse/outcome-heavy reward problems.
- Empirical results (high level):
- LawBench (191-class Chinese charge classification): harness iteration raised accuracy substantially; subsequent RL weight updates (GRPO) pushed top-1 to ~70.1% (noted improvement vs harness-only).
- TriMul CUDA kernel optimisation: harness improvements gave modest speedups; weight updates produced a drastic runtime reduction to 1,017 ms (91.9% reduction vs harness-only best).
- scRNA-seq denoising (MAGIC): combined loop improved denoising metric by ~502% over baseline.
- Infrastructure / models used: task-specific agent uses gpt-oss-120b (base) with LoRA adapters (rank 32). Meta- and Feedback-Agents use Claude Sonnet 4.6. Training and rollouts executed on H100s via a managed RL platform (Modal).
Data & Methods
- System architecture and protocol:
- Meta-Agent M generates initial scaffold A1 from a task spec U and optional references R.
- Task-specific agent Ag executes on dataset D inside a sandbox; its trajectory τg (all prompts, model outputs, tool calls, and results) is recorded.
- Feedback-Agent F receives Ag, τg, and metrics Eg and emits either a scaffold rewrite Ag+1 (harness update) or a weight-update instruction that runs an RL procedure to produce new LoRA weights θk (weight update).
- Each generation follows Execution → Analysis → Improvement.
- Models & training:
- Base LLM: gpt-oss-120b (instruction-tuned) used for task agent calls.
- Meta/Feedback LLMs: Claude Sonnet 4.6.
- Weight updates: LoRA adapters (rank 32) are trained with RL-style objectives; Feedback-Agent selects algorithms (example: GRPO—group-relative policy optimization—for LawBench).
- Rollouts and gradient updates are executed on H100 GPUs using Modal; rewards come from deterministic verifiers (per-task graders).
- Benchmarks / metrics:
- LawBench: 191-class Chinese criminal charge classification; dataset 5,332 train / 913 test; metric: top-1 accuracy; baseline SOTA ~0.450 reported.
- AlphaEvolve TriMul (CUDA): kernel runtime measured on H100; score defined as 1500/runtime (higher is better); prior benchmark score ≈1.292.
- MAGIC scRNA-seq denoising: pancreas dataset; normalized MSE metric (higher = better after inversion/scaling in paper); prior reference ~0.24.
- Experimental design:
- The Feedback-Agent commonly begins with harness iterations; when harness progress stalls it switches to weight updates.
- Comparisons reported: Baseline (initial scaffold + base weights), SIA-H (best harness-only generation), and SIA-W+H (best achieved when interleaving harness + weight updates).
- Ablations isolate the marginal contribution of weights versus harness edits.
Implications for AI Economics
- Reduces human bottlenecks in agent development:
- Automating scaffold engineering + model adaptation lowers the marginal human labor needed for prompt/tool engineering and per-task fine-tuning, potentially reducing demand for routine prompt-engineering and harness maintenance work.
- Shifts R&D cost composition toward compute and data/verifier design:
- SIA requires substantial compute (H100s for RL updates, many rollouts) and reliable verifiers. Firms may substitute human labor costs with increased capital expenditures on compute and engineering of verifiers/sandboxing.
- This favors organizations with access to large compute pools (increasing returns to scale), accelerating concentration if compute is the tight resource.
- Productivity and speed of iteration:
- Faster automated improvement loops can accelerate product development cycles (e.g., kernels, domain-specific models, data-cleaning pipelines), increasing the pace of innovation and lowering time-to-deployment for domain-specialized agents.
- Labor market and skill effects:
- Demand may shift from repetitive tuning and scaffold design toward higher-level roles: designing verifiers, constructing safe reward functions, specifying task distributions, and auditing/overseeing self-improving loops.
- Domain experts may face partial substitution where tasks admit clear verifiers (e.g., code performance, some evaluation metrics), but creative or ambiguous domains requiring richer supervision may still need humans.
- Market structure and competitive dynamics:
- Firms that internalize both model weights and the tooling to run self-improvement loops can extract more value; proprietary compute and verifier datasets become strategic assets. This could widen productivity gaps and lead to winner-take-most dynamics.
- Externalities, governance and safety economics:
- Self-improving agents raise new regulatory and assurance costs: verification pipelines, monitoring for distributional drift, and mechanisms to prevent undesirable self-modifications are necessary. These compliance and risk-mitigation costs will factor into economic assessments.
- Mis-specified verifiers or reward hacking could produce externalities (faulty models deployed at scale), which increases the social cost of lower-supervision automation.
- Research & public-good considerations:
- If self-improvement reduces the marginal cost to produce high-performing, domain-specific agents, it could accelerate downstream applications in biotech, law, and systems engineering—both beneficial (productivity/innovation) and risky (misuse, rapid deployment without oversight).
- Open access to such systems would democratize agent-building; closed, compute-intensive stacks could instead consolidate power.
Caveats and limitations to weigh when assessing economic impact: - SIA requires robust deterministic verifiers or graders; many real-world tasks lack simple automated verifiers. - The approach is compute-heavy; cost reductions from human automation may be offset by higher infrastructure spend. - Safety, overfitting to benchmarks, and reward specification remain practical constraints; auditors and fallback human-in-the-loop processes may still be required. - Empirical results reported are on three well-defined benchmarks—generalization and scaling behavior across broader, less-structured domains is not yet established.
If you want, I can (a) produce a short one-paragraph executive summary targeting product managers/investors, or (b) estimate rough cost trade-offs (human labor-hours saved vs. added compute/infra spend) for a hypothetical deployment. Which would be most useful?
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. Other | negative | high | humans-as-bottleneck in AI development |
0.06
|
| Two largely disjoint research lines attack this bottleneck: the harness-update school (a meta-agent rewrites the scaffold while model weights are fixed) and the test-time training school (hand-written RL pipelines update model weights while the harness is fixed). Other | null_result | high | classification of prior research approaches |
0.12
|
| These two silos (harness-update and test-time training) operate in isolation. Other | null_result | high | degree of integration between research lines |
0.06
|
| We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. Other | null_result | high | capability of an agent to update both harness and weights |
0.02
|
| We evaluate SIA across three contrasting domains: Chinese legal charge classification (LawBench), low-level GPU kernel optimisation, and single-cell RNA denoising. Other | null_result | high | domains/tasks used for evaluation |
0.12
|
| Combining both levers (harness updates and weight updates) outperforms scaffold iteration alone on all three benchmarks. Other | positive | high | overall task performance relative to scaffold-only baseline |
0.12
|
| Combining both levers yields a 56.6% gain on LawBench (Chinese legal charge classification) over the initial baseline. Other | positive | high | task performance on LawBench (unspecified metric in abstract) |
56.6%
0.12
|
| Combining both levers yields a 91.9% runtime reduction on GPU kernels over the initial baseline. Task Completion Time | positive | high | runtime for GPU kernels |
91.9% runtime reduction
0.12
|
| Combining both levers yields a 502% improvement on single-cell RNA denoising over the initial baseline. Other | positive | high | denoising performance for single-cell RNA data |
502%
0.12
|
| Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil. Other | positive | medium | mechanistic roles of harness updates vs weight updates |
0.04
|