AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

Production agentic systems make many model calls per user request, and most of those calls are short, structured, and routine. This raises a practical routing question that existing evaluations do not directly answer: which parts of an agent workflow truly require large frontier intelligence, and which can be handled by smaller models? We introduce AgentFloor, a deterministic 30-task benchmark organized as a six-tier capability ladder, spanning instruction following, tool use, multi-step coordination, and long-horizon planning under persistent constraints. We evaluate 16 open-weight models, from 0.27B to 32B parameters, alongside GPT-5 across 16,542 scored runs. Our results reveal a clear boundary of model necessity. Small and mid-sized open-weight models are already sufficient for much of the short-horizon, structured tool use work that dominates real agent pipelines, and in aggregate, the strongest open-weight model matches GPT-5 on our benchmark while being substantially cheaper and faster to run. The gap appears most clearly on long-horizon planning tasks that require sustained coordination and reliable constraint tracking over many steps, where frontier models still hold an advantage, though neither side reaches strong reliability. We also find that this boundary is not explained by scale alone: some failures respond to targeted interventions, but the effects are model-specific rather than universal. These findings suggest a practical design principle for agentic systems: use smaller open-weight models for the broad base of routine actions, and reserve large frontier models for the narrower class of tasks that truly demand deeper planning and control. We release the benchmark, harness, sweep configurations, and full run corpus.

Summary

Main Finding

AgentFloor defines a controlled six-tier capability ladder for agentic tool use and shows a clear, practical boundary: small and mid-sized open-weight models (substantially cheaper and faster) already handle most short-horizon, structured tool-usage tasks that dominate production agent pipelines, while frontier models retain an advantage on long-horizon planning under persistent constraints. The best open-weight model in the corpus (gemma4:26b) is statistically equivalent to GPT-5 overall on the 30-task suite, at far lower cost and latency; the remaining frontier advantage concentrates on the E (planning) tier and is intervention-sensitive.

Key Points

Benchmark design
- AgentFloor: 30 deterministic tasks, 6 tiers (A0, A, B, C, D, E), 5 tasks per tier; tiers add demands from instruction-following (A0) up to long-horizon planning with persistent constraints (E).
- Eight abstract deterministic tools operate over an in-memory fixture DB (no filesystem, no live APIs, no time-varying state, limited contamination risk).
- Step budgets per tier: A0/A/B/C/D/E = 1/2/4/6/8/10.
Evaluation corpus
- 16 open-weight models (0.27B–32B parameters) served via Ollama + GPT-5 via OpenAI Responses.
- 16,542 scored runs total (12,000 SLM runs in main sweep + GPT-5 anchors + ablations).
- Scoring: binary Task Completion Rate (TCR) with strict pass conditions (final answer, submission validator, required tool sequence, forbiddens); some semantic checks judged by a cached deterministic GPT-5-nano judge.
- Statistical framework: paired analyses, bootstrap CIs (nboot=10,000), pre-registered TOST equivalence/non-inferiority at ±10 percentage-point margin, Holm–Bonferroni correction for family tests.
Main quantitative results
- Overall parity: gemma4:26b vs GPT-5 across all 30 tasks (n=270 paired observations) – paired Δ = +0.4 pp, 90% CI within ±10 pp → equivalence at pre-registered margin.
- Tier breakdown (high level):
  - A0 (no-tool instruction): open-weight models can exceed GPT-5 (gemma4:26b 100% vs GPT-5 80%).
  - A (single-tool): gemma4:26b equivalent to GPT-5 within ±10 pp.
  - B/C/D (2-tool chains, branching, multi-source synthesis): point estimates close (<9 pp difference), but per-tier sample sizes leave results inconclusive for formal equivalence at ±10 pp.
  - E (long-horizon planning): GPT-5 superior (GPT-5 ~10% TCR vs gemma4:26b 0%); neither side is reliably deployable in absolute terms.
- Smallest reliable models by tier (95% CI lower bound criterion):
  - A0: clears 80%+ at 4B (nemotron-3-nano:4B).
  - A: clears 80% at 3B (ministral-3:3B).
  - B: no model clears 80%; 2B suffices at ~70% (qwen3.5:2B).
  - C/D/E: no model clears 60–90% thresholds in this corpus.
- Failure taxonomy: prioritized cascade (F1 hallucinated-tool, F2 malformed call, F4 budget exhausted, F5 early resignation/plan-without-execute splits, F6 wrong tool, F7 partial completion, F3 residual).
Cost & latency
- Cost-per-passed-task (examples at posted/locked rates):
  - gemma4:26b on Mac: $0.0022 per passed task vs GPT-5 $0.0327 → ≈15× cheaper at matched aggregate accuracy (~60% TCR).
  - granite4:3B: $0.00046 per pass on Mac (≈71× cheaper than GPT-5) at lower accuracy (~40% TCR).
  - On cloud GPU ($2.50/hr) gemma4:26b ≈3× cheaper than GPT-5 per passed task at matched accuracy.
- Latency (per passed task): granite4:3B ≈3.3s; ministral-3:8B ≈7.1s; gemma4:26B ≈16.1s; GPT-5 ≈40.8s.
- Some configurations (e.g., qwen3:32B with reasoning enabled) can be both slower and less accurate than GPT-5.
Intervention sensitivity
- The E-tier gap is intervention-sensitive and model-specific: some targeted ablations help particular models; there is no universal intervention that closes the gap. A structured-decomposition prompt degraded performance across models in experiments.
Release
- Authors release the benchmark, harness, sweep configurations, and full run corpus.

Data & Methods

Benchmark structure
- 30 deterministic tasks, 6-tier ladder (A0–E), each tier adding a single cognitive demand.
- Tools: {search_records, lookup_record, get_attribute, list_options, check_constraint, compare_records, compute_value, submit_decision}.
- Fixtures parameterize tool behavior; no external APIs or time-based state.
Models and runs
- 16 open-weight models (0.27B–32B) + GPT-5.
- Main sweep: 12,000 SLM runs + 274 GPT-5 anchor runs; ablations increase total to 16,542 scored runs.
- Inference controlled by a single Python runner; native tool-calling only; temperature=0; identical system prompt across models except for specific ablations.
Scoring & statistics
- Binary TCR per run; four checker families must pass (final-answer equality, submission validator, trajectory/tool-sequence predicates, no forbidden behavior).
- Semantic checks routed to deterministic cached GPT-5-nano judge for a subset of predicates.
- Bootstrap CIs (10k resamples, seed 42); paired TOST equivalence/non-inferiority at ±10 pp; Holm–Bonferroni correction for multiple SLM comparisons.
- Failure modes coded by strict priority cascade.
Ablations and sensitivity
- Variant/paraphrase instances, instance-variation ablation on four tasks, structured-prompt and explicit-submission ablations, step-budget doubling on GPT-5, Qwen3 reasoning toggles.
Limitations called out by authors
- Deterministic abstract-tool environment intentionally excludes web/GUI grounding, API drift, and contamination; this narrows but does not cover open-world deployment complexity.
- Per-tier paired sample sizes (n=45 per tier for Frame A comparisons) limit power to declare equivalence on intermediate tiers.
- Some semantic adjudication depends on a cached deterministic judge (gpt-5-nano); judge imperfections may affect a small portion of semantic failure categories.

Implications for AI Economics

Routing and cost optimization
- Practical design principle: route short-horizon, structured tool calls (A0/A and many B) to small/mid-sized open-weight models and reserve frontier calls for long-horizon planning and complicated constraint-tracking (E-type) tasks.
- Large immediate cost savings are feasible: in this controlled benchmark, open-weight models achieve the same aggregate accuracy at an order-of-magnitude lower price-per-passed-task (≈3–71× cheaper depending on model and infra). That implies material reductions in operating expenditures for high-throughput agent systems.
Product design and architecture
- Hybrid stacks become economically attractive: cheap open-weight models for the routine majority of calls, with a monitored escalation/failover path to frontier models for low-frequency, high-cognitive-demand cases.
- Latency benefits of local/smaller models can improve user experience for routine workflows, while frontier calls are reserved for use-cases where their advantage justifies higher latency and cost.
Market and procurement effects
- Demonstrable parity on many agentic tasks increases the value proposition for hosting and self-serving high-quality open-weight models (self-hosting or lower-cost cloud instances), potentially shifting demand away from frontier-only deployment models for many enterprise agent workloads.
- Pricing and subscription models for frontier APIs may need to reflect the narrower set of tasks where their marginal utility is high (long-horizon planning, sustained constraint-tracking).
Risk management and monitoring
- Because failures and improvements are model-specific and interventions are not universally effective, economics of monitoring, calibration, and failover policies become critical. Investment in automated routing classifiers, lightweight verification layers, and policy-conditional escalation is warranted.
- The residual unreliability at the E tier implies a non-negligible cost of error (manual review, human-in-the-loop), which should factor into cost-benefit analyses for routing thresholds.
Research & investment priorities
- Investing in improving small-model reliability on longer-horizon planning (either via model improvements or targeted prompting/architectural scaffolding) could unlock larger economic returns than marginal frontier scaling.
- Standardized capability-and-cost maps (like AgentFloor) are valuable decision inputs for procurement and system-design tradeoffs; further work to extend to noisy real-world APIs, grounding, and latency/call-multiplicity effects will increase practical value.
Policy and ecosystem considerations
- If production workloads shift toward hybrid routing, demand for high-quality open-weight checkpoints, tooling for safe self-hosting, and commodified orchestration infrastructure will increase — with downstream effects on cloud GPU markets, hosting providers, and API pricing strategies.

Summary takeaway: AgentFloor provides controlled evidence that much of the routine, high-volume work in agent pipelines can be economically delegated to compact open-weight models without measurable loss on those tasks, while frontier models remain economically justifiable for an identifiable, narrower class of long-horizon planning problems. System designers and procurement/finance teams can use such capability-and-cost maps to reallocate compute spend, design hybrid routing, and prioritize investments in reducing the remaining long-horizon reliability gap.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides extensive empirical benchmarking (16,542 scored runs across 16 open-weight models and GPT-5) showing consistent patterns, which gives reasonably strong descriptive evidence about capability boundaries. However, the findings are about model capabilities on a synthetic, deterministic 30-task benchmark rather than causal impacts on real-world economic outcomes or production systems, and performance may not translate directly to in-the-wild deployments or other task distributions. Methods Rigormedium — Design is systematic: a structured six-tier capability ladder, a reasonably large and diverse model set (0.27B–32B open weights plus GPT-5), and many runs with deterministic tasks and scoring. Rigor is reduced by potential task selection bias, limited information about scoring rubrics and inter-rater reliability (if any), unknown comparability of GPT-5 evaluation, and the synthetic nature of tasks that may not capture real-world agent pipeline complexity or adversarial/failure modes. SampleAgentFloor benchmark: 30 deterministic tasks organized into six capability tiers (instruction following, tool use, multi-step coordination, long-horizon planning under persistent constraints). Evaluated 16 open-weight models ranging from 0.27B to 32B parameters, plus GPT-5, over 16,542 scored runs; includes sweep configurations, harness, and full run corpus released with the paper. Themesorg_design adoption productivity innovation GeneralizabilityBenchmark is synthetic/deterministic and may not reflect noisy, adversarial, or open-ended real-world production requests., Models evaluated are limited to open-weight models up to 32B and one frontier model (GPT-5) — results may not generalize to other proprietary models or future architectures., Tooling, API latency, orchestration overheads, safety/sandboxing constraints, and cost structures in production deployments are not fully represented., Performance may vary by language, modality, or domain not covered by the 30 tasks., Some failures are model-specific and intervention-sensitive, so transferability of fixes across architectures is uncertain.

Claims (9)

Claim	Direction	Confidence	Outcome	Details
We introduce AgentFloor, a deterministic 30-task benchmark organized as a six-tier capability ladder, spanning instruction following, tool use, multi-step coordination, and long-horizon planning under persistent constraints. Research Productivity	positive	high	benchmark construction (30 tasks, six-tier capability ladder)	n=30 0.3
We evaluate 16 open-weight models, from 0.27B to 32B parameters, alongside GPT-5 across 16,542 scored runs. Research Productivity	positive	high	evaluation runs (model-by-task performance across 16,542 scored runs)	n=16542 16 open-weight models (0.27B to 32B) + GPT-5; 16,542 scored runs 0.3
Small and mid-sized open-weight models are already sufficient for much of the short-horizon, structured tool use work that dominates real agent pipelines. Task Allocation	positive	medium	ability to complete short-horizon, structured tool-use tasks on the AgentFloor benchmark	n=16542 0.11
In aggregate, the strongest open-weight model matches GPT-5 on our benchmark while being substantially cheaper and faster to run. Organizational Efficiency	null_result	medium	aggregate benchmark score (performance) and operational cost/latency	n=16542 matches GPT-5; "substantially cheaper and faster" (not numerically specified in abstract) 0.11
The gap appears most clearly on long-horizon planning tasks that require sustained coordination and reliable constraint tracking over many steps, where frontier models still hold an advantage, though neither side reaches strong reliability. Decision Quality	positive	high	performance on long-horizon planning tasks (ability to sustain coordination and track constraints over many steps)	n=16542 0.18
This boundary is not explained by scale alone: some failures respond to targeted interventions, but the effects are model-specific rather than universal. Task Allocation	mixed	medium	change in failure rate after targeted interventions (model-specific responsiveness)	0.11
These findings suggest a practical design principle for agentic systems: use smaller open-weight models for the broad base of routine actions, and reserve large frontier models for the narrower class of tasks that truly demand deeper planning and control. Task Allocation	positive	high	recommended task routing strategy for agentic systems (model assignment to task classes)	0.03
We release the benchmark, harness, sweep configurations, and full run corpus. Research Productivity	positive	high	availability of released materials (benchmark and run corpus)	0.3
Production agentic systems make many model calls per user request, and most of those calls are short, structured, and routine. Task Allocation	positive	medium	distribution of model-call types in production agentic systems (short/structured/routine vs. long-horizon/complex)	0.02