Per-query configuration selection slashes retrieval-agent serving costs without losing accuracy: BRANE picks pipeline settings at query time using LLM-derived features and lightweight predictors, matching top static configurations while reducing cost by up to 89% and outperforming routing and rule-based baselines.
Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.
Summary
Main Finding
BRANE (Building Retrieval Agents via Natural language Expressions) shows that per-query selection of full retrieval-pipeline configurations—using LLM-extracted, workload-specific binary query characteristics plus lightweight per-configuration predictors—substantially improves the cost–quality tradeoff for retrieval agents. Across three benchmarks (MuSiQue, BrowseComp-Plus, FinanceBench), BRANE matches the accuracy of the best single static pipeline while cutting cost by up to ≈89% (and in examples >80%), and it pushes the overall cost–accuracy Pareto frontier beyond strong baselines (LLM routers, rule-based routers, and fine-tuned end-to-end models).
Key Points
- Problem (Query2Conf): given a natural-language query and either an accuracy target or a budget, choose a pipeline configuration (LLM, retriever, k, hops, synthesis strategy, etc.) from a catalog to minimize cost subject to the accuracy target (or maximize accuracy under a budget).
- Motivation:
- Queries in the same workload vary widely in which configuration is optimal (per-query variance).
- Most prior work focuses on LLM selection; sweeping the entire pipeline yields much larger cost–quality gains.
- The useful query signals are workload-specific; generic labels collapse queries and give little discrimination.
- BRANE core ideas:
- Use a (frontier) LLM to generate a small set of workload-specific binary characteristics (e.g., requires_multi_hop, involves_regional_cuisine). Apply a cheaper LLM to label the whole sample on those d binary features to form F(q) ∈ {0,1}^d.
- For each candidate configuration c (after Pareto pruning), train a lightweight classifier p̂_c : {0,1}^d → [0,1] to predict the probability configuration c answers q correctly (target = profiled correctness).
- At inference choose π_λ(q) = argmax_c [p̂_c(F(q)) − λ · cost(c)] where λ ≥ 0 trades accuracy vs cost; sweep λ to trace the Pareto frontier and calibrate from accuracy/budget targets.
- Engineering choices:
- Offline profiling: run N sampled queries against every configuration to collect correctness and cost matrices; sample sizes reported 150–600 queries per workload in their experiments.
- Pareto pruning: only keep configurations that survive Pareto pruning to reduce the number of per-config predictors.
- Per-config predictors are small tabular models (logistic regression, trees, random forest, XGBoost, LightGBM) selected by cross-validated negative log-loss.
- Empirical highlights:
- BRANE consistently dominates static pipelines’ Pareto frontier across MuSiQue, BrowseComp-Plus, FinanceBench.
- Matches best fixed-pipeline accuracy at up to ~89% lower cost (MuSiQue), e.g., 0.7% higher accuracy at 81.7% lower cost on a BrowseComp-Plus case in the paper.
- Outperforms baselines including Carrot (LLM router), METIS (rule-based), Adaptive-RAG (T5 router), and fine-tuned Qwen3-4B/BERT end-to-end classifiers.
- LLM-proposed binary characteristics outperform embeddings as predictor inputs; approach robust to choice of characterizer LLM and characteristic set size.
- Practical costs: offline profiling can be expensive (example: profiling 600 queries × 60 configs cost ≈ $11k), but it is a one-time cost amortized across inference.
Data & Methods
- Benchmarks: MuSiQue, BrowseComp-Plus, FinanceBench. They profiled 150–600 queries per workload and multiple configurations (example spaces with ≈60 configurations).
- Configuration space: varies LLMs, retrievers (dense/sparse), number of retrieved docs k, number of hops, and multiple synthesis strategies (LLM-only, per-chunk summaries, iterative/agent loops). Cost model uses dollar cost per query measured during profiling; inference substitutes mean profiled cost(c).
- Offline pipeline:
- Configuration profiling: run N queries × |C| configurations to collect correctness yc(q) ∈ {0,1} and cost(q,c).
- Characteristic discovery: frontier LLM proposes d binary predicates from a small query sample; cheaper LLM labels all profiled queries to produce F(q).
- Per-configuration predictor training: for Pareto configurations, train p̂_c on {(F(qi), yc(qi))}. Use automated model selection among tabular learners.
- Calibration: sweep λ on profiling sample to map accuracy (or budget) targets to λ.
- Inference:
- For each query, compute F(q) with one LLM call, evaluate p̂_c(F(q)) for each retained config, and select argmax p̂_c − λ·mean_cost(c).
- Ablations: compared against embeddings for query representation, different characterizer LLMs and sizes, and different predictor families.
Implications for AI Economics
- Large operational cost savings per workload: BRANE demonstrates that tailoring the full pipeline per query (not just LLM) unlocks substantial reductions in dollar cost while preserving or improving accuracy. For providers and applications with high query volume, the amortized savings can quickly outweigh the upfront profiling expense.
- ROI tradeoffs: adopting BRANE requires an upfront profiling and predictor-training investment (time, compute, monetary). The paper quantifies a profiling example (∼$11k). Decision-makers should compare that one-time cost against expected per-query savings × projected query volume to compute payback period.
- Pricing and SLAs: BRANE exposes an explicit mapping from accuracy targets to expected cost (via λ calibration). This enables:
- Tiered product offerings (e.g., accuracy SLAs at different price points).
- Dynamic pricing: select cheaper pipelines for soft-SLA traffic and more expensive ones for premium traffic.
- Better capacity planning because cost and expected accuracy are predictable per query class.
- Product design & provider incentives:
- Cloud/LLM providers could offer configuration catalogs or expose per-knob pricing and performance telemetry to enable Query2Conf-style optimization.
- Marketplaces could productize “config-as-a-service” where providers offer profiled pipelines and BRANE-like routing as a managed service.
- Operational risk and governance:
- Distribution shift: workload shifts (new query types or corpus changes) require re-profiling or online adaptation; failure to do so risks misrouting and SLA violations.
- Profiling cost and privacy: profiling may be expensive and potentially sensitive if the corpus/queries are private; organizations must balance privacy and cost.
- Latency and throughput: BRANE optimizes dollar cost vs accuracy; system operators must incorporate latency/throughput constraints into the design (future work).
- Research & product directions that affect economics:
- Lower-cost profiling methods (active sampling, transfer learning across similar workloads) would reduce upfront cost and improve adoption.
- Joint optimization including latency, throughput, and monetary cost would better reflect production tradeoffs.
- Mechanisms to sell or share profiled performance curves could enable cross-customer amortization and cheaper adoption for low-volume users.
Limitations to consider before deployment: BRANE requires workload-specific profiling and offline training, uses a single LLM call per inference to obtain characterization (adds per-query cost), and uses mean profiled cost as a proxy for per-query cost. These are practical tradeoffs that must be weighed against the demonstrated per-query savings in production settings.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Other | mixed | high | configuration choices' effect on answer quality and serving cost |
0.02
|
| Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. Other | neutral | medium | configuration tuning practice (workload-level vs per-query) |
0.01
|
| We propose BRANE, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. Other | neutral | high | method (feature extraction and predictor training) |
0.02
|
| At inference time, BRANE selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Other | neutral | high | cost-quality tradeoff exposed by selection strategy |
0.02
|
| Across MuSiQue, BrowseComp-Plus, and FinanceBench, BRANE consistently pushes the cost-quality Pareto frontier. Organizational Efficiency | positive | high | cost-quality tradeoff / Pareto frontier position |
0.12
|
| BRANE matches the best fixed configuration's accuracy at up to 89% lower cost. Organizational Efficiency | positive | high | inference serving cost while maintaining accuracy |
up to 89% lower cost
0.12
|
| BRANE outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. Output Quality | positive | high | answer quality (accuracy) and/or cost-quality tradeoff relative to baselines |
0.12
|
| These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning. Organizational Efficiency | positive | high | practicality of per-query configuration vs static tuning |
0.12
|