A lightweight router called Switchcraft routes tool-using agent queries to the cheapest model that still produces correct outputs, matching top-model accuracy (82.9%) while cutting inference costs by roughly 84% and saving about $3,600 per million queries under the study's pricing assumptions.

Switchcraft: AI Model Router for Agentic Tool Calling

Sharad Agarwal, Pooria Namyar, Alec Wolman, Rahul Ambavat, Ankur Gupta, Qizheng Zhang · May 08, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Switchcraft is an inline, cost-aware model router for agentic tool calls that matches or exceeds the best single-model correctness (82.9%) while reducing inference cost by about 84%, saving roughly $3,600 per million queries under the paper's pricing assumptions.

Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets. Model routing can mitigate this, but existing routers are designed for chat completion rather than tool use. We present Switchcraft, the first (to the best of our knowledge) model router optimized for agentic tool calling. Switchcraft operates inline, selecting the lowest-cost model subject to correctness. We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget. Switchcraft achieves 82.9% accuracy -- matching or exceeding the best individual model -- while reducing inference cost by 84%, saving over $3,600 per million queries. We find that larger models do not consistently outperform smaller ones on tool-use tasks, and that nominally cheaper models can incur higher total cost due to token-intensive reasoning. Our work enables cost-aware agentic AI deployment without sacrificing correctness.

Summary

Main Finding

Switchcraft is a lightweight, agentic-aware model router that selects the cheapest LLM predicted to produce correct tool invocations. On a pooled set of eight candidate models and five function-calling benchmarks, a DistilBERT-based Switchcraft matches or exceeds the best single-model accuracy (82.9% vs. 82.3 for GPT-5.3-chat) while cutting average inference cost by 84% (≈$3,630 saved per million queries). It closes ~37% of the accuracy gap to an oracle that always picks the cheapest correct model.

Key Points

Problem addressed: Existing model routers target chat completion and are ill-suited for agentic tool-calling, where precise multi-step tool invocations and parameter correctness are critical.
Router architecture: Two-stage pipeline
DistilBERT (66M) multi-label classifier predicts which candidate models will be correct for a given agentic query/context.
Cost-aware selection picks the cheapest model among those predicted-correct, using profiled per-query input/output token costs.
Evaluation: Combined five function-calling benchmarks (BFCL v3, AWS ConFETTI, xLAM-60K, Glaive, Hermes), unified and normalized to a common schema; total ≈122k deduplicated examples; per-turn decomposition for multi-turn data.
AST-based scoring: Developed a robust AST comparator for tool-call correctness that handles:
- Order-insensitive set/list comparisons,
- Default parameters,
- String canonicalization and fuzzy matching (DistilBERT cosine threshold),
- Recursive nested-object comparison and schema-aware disambiguation. This fixed multiple biases in the BFCL checker and produced consistent correctness labels across datasets.
Performance:
- Switchcraft (DistilBERT) — 82.94% accuracy; avg cost 6.8e-4 $/query.
- Best single model — GPT-5.3-chat: 82.29% accuracy; avg cost 4.31e-3 $/query (≈6× more expensive).
- Oracle upper bound — 89.39% accuracy; avg cost 9.6e-4 $/query.
Latency: DistilBERT router adds minimal overhead (P99 3–17 ms on NVIDIA T4), enabling high throughput (~722 qps).
Error breakdown (validation): 40.9% cheapest-correct selected; 42.0% correct but not cheapest; 7.4% avoidable wrong-model selections; 9.7% cases where no model in pool is correct.
Important empirical findings:
- Costlier or newer models are not always more accurate on tool-calling (e.g., GPT-5.3-chat > GPT-5.4).
- Per-token pricing can be a poor proxy for per-query cost: "chattiness" (verbose output/long reasoning) increases total billed cost. A nominally cheaper model can end up more expensive if it emits many tokens.
- Open-weight models (e.g., Kimi-K2.5, Qwen-3.5-9B) underperform on structured function-calling due to format violations, hallucinated arguments, or refusal to call tools.

Data & Methods

Datasets: Aggregated five public function-calling benchmarks (total 157,101 entries before dedup; 122,267 after dedup), covering single- and multi-turn, parallel API calls, and a mix of synthetic and human data. Cleaned and normalized each dataset to a BFCL-like JSONL schema; corrected many entries (e.g., 27% of ConFETTI).
Candidate models: Eight LLMs spanning proprietary and open-weight families, including GPT-5.x variants, Qwen, Kimi, etc. All queries were executed across all candidate LLMs during fine-tuning data collection.
Labeling: Static AST-based comparison between model output tool-calls and ground truth (rather than executing external APIs). The AST checker incorporates schema parsing and canonicalization to accept semantically equivalent calls.
Router training: Fine-tuned DistilBERT as an eight-head multi-label classifier on (query,metadata,tool-signatures,context) packed into 512 tokens. Input packing prioritized latest user turn, compact tool signatures, recent turns, and three numeric metadata features (length, num_tools, num_turns).
Cost model: Profiled per-model per-query input/output token counts on training data, multiplied by list per-token prices to compute expected dollar cost per model per query; used to choose the cheapest predicted-correct model.
Baselines and ablations: Compared to single models, simple heuristics (num turns / length / num tools), other encoders (ModernBERT, DeBERTa-v3), naive truncation (worse), and an oracle upper bound.

Implications for AI Economics

Substantial operational savings: Intelligent agentic routing can sharply reduce inference spending (84% reduction vs. a costly, high-accuracy model in this pool). For high-volume agentic applications, routers like Switchcraft materially lower marginal costs.
Procurement and pricing incentives:
- Per-token pricing can misalign incentives: models that verbose-cost more per query even if per-token price is lower. This suggests providers and customers may want pricing or SLAs that account for expected per-query behavior (e.g., per-call pricing tiers, caps, or incentives for concise structured output).
- Buyers should avoid defaulting to newest/largest models; task-specialized routing can yield better cost–accuracy trade-offs.
Product design and resource allocation:
- Service providers can increase GPU availability for high-value workloads by routing trivial agentic queries to smaller models, improving throughput and reducing queuing.
- Providers could offer managed routing services or router-as-a-service, or bundle model pools and routing policies as a product.
Market for model specialization and fine-tuning:
- There is value in models explicitly optimized for structured function-calling and concise tool invocation. Providers might offer cheaper, instruction-tuned variants targeted at agentic workloads.
Externalities and safety:
- Routing decisions change downstream risk profiles: misrouted agentic queries can lead to harmful irreversible actions (financial loss, system misconfiguration). Economic evaluations must include error-cost externalities; in high-risk domains one might set stricter thresholds (higher-cost fallbacks) or prefer multi-stage verification despite higher spend.
Limits & future economics considerations:
- The savings depend on the available model pool; as models evolve and new low-cost but high-quality models appear, the economics will shift. Switchcraft’s gains are pool-dependent.
- Router training relies on static AST scoring (no live tool execution), which may undercount failures that only appear when interacting with real APIs; this affects expected vs. realized downstream costs and liabilities.
- Pricing and contract design should account for chattiness and encourage concise, deterministic tool-calling behavior (e.g., billing adjustments, response length incentives).

Overall, Switchcraft demonstrates that a small, specialized router can capture substantial cost efficiencies for agentic AI workloads without sacrificing accuracy—an important lever for enterprises and cloud providers to control inference spend and improve resource utilization.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides clear empirical evaluation on five function-calling benchmarks and reports concrete accuracy and cost-savings metrics, which credibly demonstrate the technical claims within those datasets. However, the economic claim (cost savings at scale) rests on simulated/pricing assumptions and offline benchmarks rather than field deployment or causal analysis of firm-level outcomes, limiting external validity. Methods Rigormedium — The authors build an evaluation framework, train a DistilBERT-based classifier, and constrain operation under a latency budget, comparing against multiple base models; this is an appropriate experimental approach for a systems paper. Rigor is reduced by limited transparency in the abstract about dataset sizes, the exact benchmark construction, choices of baseline models and hyperparameters, robustness checks, and lack of real-world deployment data or adversarial/longitudinal testing. SampleEvaluation uses five function-calling (agentic tool-use) benchmarks; a DistilBERT-based classifier is trained to route calls to different LLMs under a latency budget; comparisons include several base models (large and smaller/cheaper models); cost savings are estimated (e.g., $3,600 saved per million queries) under the paper's token/pricing assumptions. Exact benchmark names, dataset sizes, and model lists are not specified in the provided abstract. Themesadoption productivity GeneralizabilityResults limited to the five chosen function-calling benchmarks and may not hold across other tool types or domains, Performance depends on the specific set of candidate models and their tokenization/cost characteristics; other model portfolios may change outcomes, Cost savings are sensitive to cloud/pricing assumptions and token/latency accounting used in simulations, Latency and infrastructure constraints in real deployments (networking, concurrency, cold starts) may alter router performance, Robustness to adversarial inputs, distribution shift, or changing tool APIs is unclear, Does not measure downstream user satisfaction, productivity, or firm-level adoption dynamics

Claims (12)

Claim	Direction	Confidence	Outcome	Details
Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets. Organizational Efficiency	negative	high	inference cost / developer tendency to use large models	0.09
Model routing can mitigate the cost of agentic tool use, but existing routers are designed for chat completion rather than tool use. Organizational Efficiency	mixed	high	cost mitigation via model routing; applicability of existing routers to tool use	0.18
We present Switchcraft, the first (to the best of our knowledge) model router optimized for agentic tool calling. Innovation Output	positive	high	availability of a router optimized for agentic tool calling	0.18
Switchcraft operates inline, selecting the lowest-cost model subject to correctness. Organizational Efficiency	positive	high	model selection strategy (cost minimization constrained by correctness)	0.18
We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget. Research Productivity	neutral	high	evaluation framework and classifier training/deployment	n=5 0.3
Switchcraft achieves 82.9% accuracy. Output Quality	positive	high	accuracy	82.9% accuracy 0.18
Switchcraft's accuracy matches or exceeds the best individual model. Output Quality	positive	high	relative accuracy compared to individual models	0.18
Switchcraft reduces inference cost by 84%. Organizational Efficiency	positive	high	inference cost reduction	84% reduction 0.18
Switchcraft saves over $3,600 per million queries. Organizational Efficiency	positive	high	monetary savings per million queries	$3,600 per million queries 0.18
Larger models do not consistently outperform smaller ones on tool-use tasks. Output Quality	mixed	high	relative performance of larger vs smaller models on tool-use tasks	0.18
Nominally cheaper models can incur higher total cost due to token-intensive reasoning. Organizational Efficiency	negative	high	total inference cost as a function of token usage and per-token price	0.18
Switchcraft enables cost-aware agentic AI deployment without sacrificing correctness. Organizational Efficiency	positive	medium	trade-off between cost reduction and correctness (accuracy)	0.11