A lightweight router called Switchcraft routes tool-using agent queries to the cheapest model that still produces correct outputs, matching top-model accuracy (82.9%) while cutting inference costs by roughly 84% and saving about $3,600 per million queries under the study's pricing assumptions.
Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets. Model routing can mitigate this, but existing routers are designed for chat completion rather than tool use. We present Switchcraft, the first (to the best of our knowledge) model router optimized for agentic tool calling. Switchcraft operates inline, selecting the lowest-cost model subject to correctness. We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget. Switchcraft achieves 82.9% accuracy -- matching or exceeding the best individual model -- while reducing inference cost by 84%, saving over $3,600 per million queries. We find that larger models do not consistently outperform smaller ones on tool-use tasks, and that nominally cheaper models can incur higher total cost due to token-intensive reasoning. Our work enables cost-aware agentic AI deployment without sacrificing correctness.
Summary
Main Finding
Switchcraft is a lightweight, agentic-aware model router that selects the cheapest LLM predicted to produce correct tool invocations. On a pooled set of eight candidate models and five function-calling benchmarks, a DistilBERT-based Switchcraft matches or exceeds the best single-model accuracy (82.9% vs. 82.3 for GPT-5.3-chat) while cutting average inference cost by 84% (≈$3,630 saved per million queries). It closes ~37% of the accuracy gap to an oracle that always picks the cheapest correct model.
Key Points
- Problem addressed: Existing model routers target chat completion and are ill-suited for agentic tool-calling, where precise multi-step tool invocations and parameter correctness are critical.
- Router architecture: Two-stage pipeline
- DistilBERT (66M) multi-label classifier predicts which candidate models will be correct for a given agentic query/context.
- Cost-aware selection picks the cheapest model among those predicted-correct, using profiled per-query input/output token costs.
- Evaluation: Combined five function-calling benchmarks (BFCL v3, AWS ConFETTI, xLAM-60K, Glaive, Hermes), unified and normalized to a common schema; total ≈122k deduplicated examples; per-turn decomposition for multi-turn data.
- AST-based scoring: Developed a robust AST comparator for tool-call correctness that handles:
- Order-insensitive set/list comparisons,
- Default parameters,
- String canonicalization and fuzzy matching (DistilBERT cosine threshold),
- Recursive nested-object comparison and schema-aware disambiguation. This fixed multiple biases in the BFCL checker and produced consistent correctness labels across datasets.
- Performance:
- Switchcraft (DistilBERT) — 82.94% accuracy; avg cost 6.8e-4 $/query.
- Best single model — GPT-5.3-chat: 82.29% accuracy; avg cost 4.31e-3 $/query (≈6× more expensive).
- Oracle upper bound — 89.39% accuracy; avg cost 9.6e-4 $/query.
- Latency: DistilBERT router adds minimal overhead (P99 3–17 ms on NVIDIA T4), enabling high throughput (~722 qps).
- Error breakdown (validation): 40.9% cheapest-correct selected; 42.0% correct but not cheapest; 7.4% avoidable wrong-model selections; 9.7% cases where no model in pool is correct.
- Important empirical findings:
- Costlier or newer models are not always more accurate on tool-calling (e.g., GPT-5.3-chat > GPT-5.4).
- Per-token pricing can be a poor proxy for per-query cost: "chattiness" (verbose output/long reasoning) increases total billed cost. A nominally cheaper model can end up more expensive if it emits many tokens.
- Open-weight models (e.g., Kimi-K2.5, Qwen-3.5-9B) underperform on structured function-calling due to format violations, hallucinated arguments, or refusal to call tools.
Data & Methods
- Datasets: Aggregated five public function-calling benchmarks (total 157,101 entries before dedup; 122,267 after dedup), covering single- and multi-turn, parallel API calls, and a mix of synthetic and human data. Cleaned and normalized each dataset to a BFCL-like JSONL schema; corrected many entries (e.g., 27% of ConFETTI).
- Candidate models: Eight LLMs spanning proprietary and open-weight families, including GPT-5.x variants, Qwen, Kimi, etc. All queries were executed across all candidate LLMs during fine-tuning data collection.
- Labeling: Static AST-based comparison between model output tool-calls and ground truth (rather than executing external APIs). The AST checker incorporates schema parsing and canonicalization to accept semantically equivalent calls.
- Router training: Fine-tuned DistilBERT as an eight-head multi-label classifier on (query,metadata,tool-signatures,context) packed into 512 tokens. Input packing prioritized latest user turn, compact tool signatures, recent turns, and three numeric metadata features (length, num_tools, num_turns).
- Cost model: Profiled per-model per-query input/output token counts on training data, multiplied by list per-token prices to compute expected dollar cost per model per query; used to choose the cheapest predicted-correct model.
- Baselines and ablations: Compared to single models, simple heuristics (num turns / length / num tools), other encoders (ModernBERT, DeBERTa-v3), naive truncation (worse), and an oracle upper bound.
Implications for AI Economics
- Substantial operational savings: Intelligent agentic routing can sharply reduce inference spending (84% reduction vs. a costly, high-accuracy model in this pool). For high-volume agentic applications, routers like Switchcraft materially lower marginal costs.
- Procurement and pricing incentives:
- Per-token pricing can misalign incentives: models that verbose-cost more per query even if per-token price is lower. This suggests providers and customers may want pricing or SLAs that account for expected per-query behavior (e.g., per-call pricing tiers, caps, or incentives for concise structured output).
- Buyers should avoid defaulting to newest/largest models; task-specialized routing can yield better cost–accuracy trade-offs.
- Product design and resource allocation:
- Service providers can increase GPU availability for high-value workloads by routing trivial agentic queries to smaller models, improving throughput and reducing queuing.
- Providers could offer managed routing services or router-as-a-service, or bundle model pools and routing policies as a product.
- Market for model specialization and fine-tuning:
- There is value in models explicitly optimized for structured function-calling and concise tool invocation. Providers might offer cheaper, instruction-tuned variants targeted at agentic workloads.
- Externalities and safety:
- Routing decisions change downstream risk profiles: misrouted agentic queries can lead to harmful irreversible actions (financial loss, system misconfiguration). Economic evaluations must include error-cost externalities; in high-risk domains one might set stricter thresholds (higher-cost fallbacks) or prefer multi-stage verification despite higher spend.
- Limits & future economics considerations:
- The savings depend on the available model pool; as models evolve and new low-cost but high-quality models appear, the economics will shift. Switchcraft’s gains are pool-dependent.
- Router training relies on static AST scoring (no live tool execution), which may undercount failures that only appear when interacting with real APIs; this affects expected vs. realized downstream costs and liabilities.
- Pricing and contract design should account for chattiness and encourage concise, deterministic tool-calling behavior (e.g., billing adjustments, response length incentives).
Overall, Switchcraft demonstrates that a small, specialized router can capture substantial cost efficiencies for agentic AI workloads without sacrificing accuracy—an important lever for enterprises and cloud providers to control inference spend and improve resource utilization.
Assessment
Claims (12)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Agentic AI systems that invoke external tools are powerful but costly, leading developers to default to large models and overspend inference budgets. Organizational Efficiency | negative | high | inference cost / developer tendency to use large models |
0.09
|
| Model routing can mitigate the cost of agentic tool use, but existing routers are designed for chat completion rather than tool use. Organizational Efficiency | mixed | high | cost mitigation via model routing; applicability of existing routers to tool use |
0.18
|
| We present Switchcraft, the first (to the best of our knowledge) model router optimized for agentic tool calling. Innovation Output | positive | high | availability of a router optimized for agentic tool calling |
0.18
|
| Switchcraft operates inline, selecting the lowest-cost model subject to correctness. Organizational Efficiency | positive | high | model selection strategy (cost minimization constrained by correctness) |
0.18
|
| We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget. Research Productivity | neutral | high | evaluation framework and classifier training/deployment |
n=5
0.3
|
| Switchcraft achieves 82.9% accuracy. Output Quality | positive | high | accuracy |
82.9% accuracy
0.18
|
| Switchcraft's accuracy matches or exceeds the best individual model. Output Quality | positive | high | relative accuracy compared to individual models |
0.18
|
| Switchcraft reduces inference cost by 84%. Organizational Efficiency | positive | high | inference cost reduction |
84% reduction
0.18
|
| Switchcraft saves over $3,600 per million queries. Organizational Efficiency | positive | high | monetary savings per million queries |
$3,600 per million queries
0.18
|
| Larger models do not consistently outperform smaller ones on tool-use tasks. Output Quality | mixed | high | relative performance of larger vs smaller models on tool-use tasks |
0.18
|
| Nominally cheaper models can incur higher total cost due to token-intensive reasoning. Organizational Efficiency | negative | high | total inference cost as a function of token usage and per-token price |
0.18
|
| Switchcraft enables cost-aware agentic AI deployment without sacrificing correctness. Organizational Efficiency | positive | medium | trade-off between cost reduction and correctness (accuracy) |
0.11
|