A self-improving loop that lets a language-model agent rewrite its scaffold and fine-tune weights substantially outperforms prior siloed approaches across three technical domains, cutting GPU kernel runtimes by ~92% and delivering major gains in legal classification and RNA denoising; combining harness changes (to make the agent more agentic) with weight updates (to build domain intuition) yields much larger improvements than changing either alone.

SIA: Self Improving AI with Harness & Weight Updates

Prannay Hebbar, Yogendra Manawat, Samuel Verboomen, Alesia Ivanova, Selvam Palanimalai, Kunal Bhatia, Vignesh Baskaran · May 26, 2026

arxiv other medium evidence 7/10 relevance Source PDF

SIA — a self-improving loop in which a language-model Feedback-Agent updates both the task harness and model weights — outperforms harness-only and weight-only approaches across three benchmarks, producing large gains in legal classification accuracy, GPU kernel runtime reduction, and single-cell RNA denoising quality.

Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The harness-update school has a meta-agent rewrite the scaffold of a task-specific agent (its tools, prompts, retry logic, and search procedure) while the model weights are held fixed. The test-time training school uses hand-written RL pipelines to update the model's own weights on task feedback while the harness is held fixed. These two silos operate in isolation. We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. We evaluate across three contrasting domains: Chinese legal charge classification, low-level GPU kernel optimisation, and single-cell RNA denoising. Combining both levers outperforms scaffold iteration alone on all three benchmarks. The gains are 56.6% on LawBench, 91.9% runtime reduction on GPU kernels, and 502% on denoising over the initial baseline. Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.

Summary

Main Finding

SIA (Self Improving AI) is a closed-loop system in which a Feedback-Agent both rewrites an agent’s harness/scaffold (prompts, tool-dispatch, retry/parse logic) and triggers test-time weight updates (LoRA adapters via RL). Across three diverse domains—Chinese legal charge classification (LawBench), low-level GPU kernel optimisation (TriMul), and single-cell RNA denoising—interleaving harness edits with weight updates outperforms harness-only iteration. Reported gains (over the initial baseline) are large: +56.6% on LawBench, a 91.9% runtime reduction on GPU kernels (12,483 → 1,017 ms; 14.02× over the unoptimised initial), and +502% on denoising. Ablations show weight updates add improvements beyond what scaffold evolution delivers.

Key Points

Two prior “silos” addressed self-improvement separately:
- Harness/scaffold editing (meta-agents rewrite agent code/prompt/tool logic) while keeping model weights fixed.
- Test-time training (RL / fine-tuning) that updates model weights while holding the scaffold fixed.
SIA unifies the levers: a Feedback-Agent inspects full execution trajectories (every prompt, response, tool call, grader result) and chooses, each iteration, whether to (a) synthesize an improved harness or (b) run a weight-update training step (LoRA adapters).
Feedback-Agent decisions are dynamic and conditioned on observed reward dynamics; harness and weight updates are freely interleaved rather than strictly sequential.
Mechanistic separation observed:
- Harness edits improve external infrastructure: parsing, retries, tooling, and engineering hygiene that shape agentic behavior/search.
- Weight updates embed domain-specific intuition and internal representations that no prompt or scaffold could fully provide—critical for fine-grained distinctions and sparse/outcome-heavy reward problems.
Empirical results (high level):
- LawBench (191-class Chinese charge classification): harness iteration raised accuracy substantially; subsequent RL weight updates (GRPO) pushed top-1 to ~70.1% (noted improvement vs harness-only).
- TriMul CUDA kernel optimisation: harness improvements gave modest speedups; weight updates produced a drastic runtime reduction to 1,017 ms (91.9% reduction vs harness-only best).
- scRNA-seq denoising (MAGIC): combined loop improved denoising metric by ~502% over baseline.
Infrastructure / models used: task-specific agent uses gpt-oss-120b (base) with LoRA adapters (rank 32). Meta- and Feedback-Agents use Claude Sonnet 4.6. Training and rollouts executed on H100s via a managed RL platform (Modal).

Data & Methods

System architecture and protocol:
- Meta-Agent M generates initial scaffold A1 from a task spec U and optional references R.
- Task-specific agent Ag executes on dataset D inside a sandbox; its trajectory τg (all prompts, model outputs, tool calls, and results) is recorded.
- Feedback-Agent F receives Ag, τg, and metrics Eg and emits either a scaffold rewrite Ag+1 (harness update) or a weight-update instruction that runs an RL procedure to produce new LoRA weights θk (weight update).
- Each generation follows Execution → Analysis → Improvement.
Models & training:
- Base LLM: gpt-oss-120b (instruction-tuned) used for task agent calls.
- Meta/Feedback LLMs: Claude Sonnet 4.6.
- Weight updates: LoRA adapters (rank 32) are trained with RL-style objectives; Feedback-Agent selects algorithms (example: GRPO—group-relative policy optimization—for LawBench).
- Rollouts and gradient updates are executed on H100 GPUs using Modal; rewards come from deterministic verifiers (per-task graders).
Benchmarks / metrics:
- LawBench: 191-class Chinese criminal charge classification; dataset 5,332 train / 913 test; metric: top-1 accuracy; baseline SOTA ~0.450 reported.
- AlphaEvolve TriMul (CUDA): kernel runtime measured on H100; score defined as 1500/runtime (higher is better); prior benchmark score ≈1.292.
- MAGIC scRNA-seq denoising: pancreas dataset; normalized MSE metric (higher = better after inversion/scaling in paper); prior reference ~0.24.
Experimental design:
- The Feedback-Agent commonly begins with harness iterations; when harness progress stalls it switches to weight updates.
- Comparisons reported: Baseline (initial scaffold + base weights), SIA-H (best harness-only generation), and SIA-W+H (best achieved when interleaving harness + weight updates).
- Ablations isolate the marginal contribution of weights versus harness edits.

Implications for AI Economics

Reduces human bottlenecks in agent development:
- Automating scaffold engineering + model adaptation lowers the marginal human labor needed for prompt/tool engineering and per-task fine-tuning, potentially reducing demand for routine prompt-engineering and harness maintenance work.
Shifts R&D cost composition toward compute and data/verifier design:
- SIA requires substantial compute (H100s for RL updates, many rollouts) and reliable verifiers. Firms may substitute human labor costs with increased capital expenditures on compute and engineering of verifiers/sandboxing.
- This favors organizations with access to large compute pools (increasing returns to scale), accelerating concentration if compute is the tight resource.
Productivity and speed of iteration:
- Faster automated improvement loops can accelerate product development cycles (e.g., kernels, domain-specific models, data-cleaning pipelines), increasing the pace of innovation and lowering time-to-deployment for domain-specialized agents.
Labor market and skill effects:
- Demand may shift from repetitive tuning and scaffold design toward higher-level roles: designing verifiers, constructing safe reward functions, specifying task distributions, and auditing/overseeing self-improving loops.
- Domain experts may face partial substitution where tasks admit clear verifiers (e.g., code performance, some evaluation metrics), but creative or ambiguous domains requiring richer supervision may still need humans.
Market structure and competitive dynamics:
- Firms that internalize both model weights and the tooling to run self-improvement loops can extract more value; proprietary compute and verifier datasets become strategic assets. This could widen productivity gaps and lead to winner-take-most dynamics.
Externalities, governance and safety economics:
- Self-improving agents raise new regulatory and assurance costs: verification pipelines, monitoring for distributional drift, and mechanisms to prevent undesirable self-modifications are necessary. These compliance and risk-mitigation costs will factor into economic assessments.
- Mis-specified verifiers or reward hacking could produce externalities (faulty models deployed at scale), which increases the social cost of lower-supervision automation.
Research & public-good considerations:
- If self-improvement reduces the marginal cost to produce high-performing, domain-specific agents, it could accelerate downstream applications in biotech, law, and systems engineering—both beneficial (productivity/innovation) and risky (misuse, rapid deployment without oversight).
- Open access to such systems would democratize agent-building; closed, compute-intensive stacks could instead consolidate power.

Caveats and limitations to weigh when assessing economic impact: - SIA requires robust deterministic verifiers or graders; many real-world tasks lack simple automated verifiers. - The approach is compute-heavy; cost reductions from human automation may be offset by higher infrastructure spend. - Safety, overfitting to benchmarks, and reward specification remain practical constraints; auditors and fallback human-in-the-loop processes may still be required. - Empirical results reported are on three well-defined benchmarks—generalization and scaling behavior across broader, less-structured domains is not yet established.

If you want, I can (a) produce a short one-paragraph executive summary targeting product managers/investors, or (b) estimate rough cost trade-offs (human labor-hours saved vs. added compute/infra spend) for a hypothetical deployment. Which would be most useful?

Assessment

Paper Typeother Evidence Strengthmedium — The paper reports large, consistent gains across three very different technical benchmarks (legal charge classification, GPU kernel optimization, single-cell RNA denoising), which supports the claim that jointly updating harness and weights can be beneficial; however, the evidence is limited by a small number of domains, potential selection/construction of benchmarks, unclear reporting of statistical variability, sensitivity to hyperparameters and compute budgets, and limited discussion of competing baselines and reproducibility. Methods Rigormedium — The evaluation uses controlled ablations (harness-only, weight-only, combined) and multiple domains, which is good experimental practice; but the paper (as summarized) lacks detail on sample sizes, random-seed aggregation and variance, baseline tuning parity, hyperparameter sweeps, and open-source artifacts for replication, reducing confidence in robustness and potential overfitting to chosen tasks. SampleThree technical benchmark domains: (1) Chinese legal charge classification using LawBench (a supervised classification dataset in Chinese legal texts); (2) low-level GPU kernel optimization tasks measuring runtime/performance of generated kernels; (3) single-cell RNA denoising tasks measuring reconstruction/denoising quality; models consist of a task-specific agent scaffold and underlying pretrained language/model weights that are subject to either scaffold (harness) updates, weight updates, or both via the proposed Feedback-Agent loop. Exact dataset sizes, train/validation/test splits, model sizes, and number of experimental seeds are not reported in the summary. Themeshuman_ai_collab productivity IdentificationControlled experimental comparisons and ablation: the authors implement SIA and compare its performance against harness-only (scaffold iteration) and weight-only (test-time training) baselines across three benchmark domains, reporting metric improvements; no formal causal inference methods, randomization beyond standard ML training seeds, or external instruments are used. GeneralizabilityOnly three technical, domain-specific benchmarks were tested (legal Chinese text, GPU kernels, single-cell RNA), so results may not generalize to broad, open-ended tasks (e.g., conversational agents, enterprise workflows)., Outcomes may depend on specific pretrained models, scaffolding designs, and compute budgets used; different models or limited compute could change results., Potential benchmark selection bias: chosen tasks may favor benefits from combined harness+weight updates., Unclear robustness to hyperparameter choices, random seeds, or adversarial/real-world noise., Economic implications (productivity, labor effects) are indirect — no direct measurement of firm-level productivity, costs, or labor outcomes.

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. Other	negative	high	humans-as-bottleneck in AI development	0.06
Two largely disjoint research lines attack this bottleneck: the harness-update school (a meta-agent rewrites the scaffold while model weights are fixed) and the test-time training school (hand-written RL pipelines update model weights while the harness is fixed). Other	null_result	high	classification of prior research approaches	0.12
These two silos (harness-update and test-time training) operate in isolation. Other	null_result	high	degree of integration between research lines	0.06
We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. Other	null_result	high	capability of an agent to update both harness and weights	0.02
We evaluate SIA across three contrasting domains: Chinese legal charge classification (LawBench), low-level GPU kernel optimisation, and single-cell RNA denoising. Other	null_result	high	domains/tasks used for evaluation	0.12
Combining both levers (harness updates and weight updates) outperforms scaffold iteration alone on all three benchmarks. Other	positive	high	overall task performance relative to scaffold-only baseline	0.12
Combining both levers yields a 56.6% gain on LawBench (Chinese legal charge classification) over the initial baseline. Other	positive	high	task performance on LawBench (unspecified metric in abstract)	56.6% 0.12
Combining both levers yields a 91.9% runtime reduction on GPU kernels over the initial baseline. Task Completion Time	positive	high	runtime for GPU kernels	91.9% runtime reduction 0.12
Combining both levers yields a 502% improvement on single-cell RNA denoising over the initial baseline. Other	positive	high	denoising performance for single-cell RNA data	502% 0.12
Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil. Other	positive	medium	mechanistic roles of harness updates vs weight updates	0.04