The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Per-query configuration selection slashes retrieval-agent serving costs without losing accuracy: BRANE picks pipeline settings at query time using LLM-derived features and lightweight predictors, matching top static configurations while reducing cost by up to 89% and outperforming routing and rule-based baselines.

Natural Language Query to Configuration for Retrieval Agents
Melissa Z. Pan, Negar Arabzadeh, Mathew Jacob, Fiodar Kazhamiaka, Esha Choukse, Matei Zaharia · May 26, 2026
arxiv other medium evidence 7/10 relevance Source PDF
BRANE uses an LLM to extract query-level features and lightweight per-configuration predictors to select retrieval pipeline settings at inference time, achieving similar accuracy to the best fixed configuration at up to 89% lower serving cost across three QA benchmarks.

Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.

Summary

Main Finding

BRANE (Building Retrieval Agents via Natural language Expressions) shows that per-query selection of full retrieval-pipeline configurations—using LLM-extracted, workload-specific binary query characteristics plus lightweight per-configuration predictors—substantially improves the cost–quality tradeoff for retrieval agents. Across three benchmarks (MuSiQue, BrowseComp-Plus, FinanceBench), BRANE matches the accuracy of the best single static pipeline while cutting cost by up to ≈89% (and in examples >80%), and it pushes the overall cost–accuracy Pareto frontier beyond strong baselines (LLM routers, rule-based routers, and fine-tuned end-to-end models).

Key Points

  • Problem (Query2Conf): given a natural-language query and either an accuracy target or a budget, choose a pipeline configuration (LLM, retriever, k, hops, synthesis strategy, etc.) from a catalog to minimize cost subject to the accuracy target (or maximize accuracy under a budget).
  • Motivation:
    • Queries in the same workload vary widely in which configuration is optimal (per-query variance).
    • Most prior work focuses on LLM selection; sweeping the entire pipeline yields much larger cost–quality gains.
    • The useful query signals are workload-specific; generic labels collapse queries and give little discrimination.
  • BRANE core ideas:
  • Use a (frontier) LLM to generate a small set of workload-specific binary characteristics (e.g., requires_multi_hop, involves_regional_cuisine). Apply a cheaper LLM to label the whole sample on those d binary features to form F(q) ∈ {0,1}^d.
  • For each candidate configuration c (after Pareto pruning), train a lightweight classifier p̂_c : {0,1}^d → [0,1] to predict the probability configuration c answers q correctly (target = profiled correctness).
  • At inference choose π_λ(q) = argmax_c [p̂_c(F(q)) − λ · cost(c)] where λ ≥ 0 trades accuracy vs cost; sweep λ to trace the Pareto frontier and calibrate from accuracy/budget targets.
  • Engineering choices:
    • Offline profiling: run N sampled queries against every configuration to collect correctness and cost matrices; sample sizes reported 150–600 queries per workload in their experiments.
    • Pareto pruning: only keep configurations that survive Pareto pruning to reduce the number of per-config predictors.
    • Per-config predictors are small tabular models (logistic regression, trees, random forest, XGBoost, LightGBM) selected by cross-validated negative log-loss.
  • Empirical highlights:
    • BRANE consistently dominates static pipelines’ Pareto frontier across MuSiQue, BrowseComp-Plus, FinanceBench.
    • Matches best fixed-pipeline accuracy at up to ~89% lower cost (MuSiQue), e.g., 0.7% higher accuracy at 81.7% lower cost on a BrowseComp-Plus case in the paper.
    • Outperforms baselines including Carrot (LLM router), METIS (rule-based), Adaptive-RAG (T5 router), and fine-tuned Qwen3-4B/BERT end-to-end classifiers.
    • LLM-proposed binary characteristics outperform embeddings as predictor inputs; approach robust to choice of characterizer LLM and characteristic set size.
  • Practical costs: offline profiling can be expensive (example: profiling 600 queries × 60 configs cost ≈ $11k), but it is a one-time cost amortized across inference.

Data & Methods

  • Benchmarks: MuSiQue, BrowseComp-Plus, FinanceBench. They profiled 150–600 queries per workload and multiple configurations (example spaces with ≈60 configurations).
  • Configuration space: varies LLMs, retrievers (dense/sparse), number of retrieved docs k, number of hops, and multiple synthesis strategies (LLM-only, per-chunk summaries, iterative/agent loops). Cost model uses dollar cost per query measured during profiling; inference substitutes mean profiled cost(c).
  • Offline pipeline:
  • Configuration profiling: run N queries × |C| configurations to collect correctness yc(q) ∈ {0,1} and cost(q,c).
  • Characteristic discovery: frontier LLM proposes d binary predicates from a small query sample; cheaper LLM labels all profiled queries to produce F(q).
  • Per-configuration predictor training: for Pareto configurations, train p̂_c on {(F(qi), yc(qi))}. Use automated model selection among tabular learners.
  • Calibration: sweep λ on profiling sample to map accuracy (or budget) targets to λ.
  • Inference:
    • For each query, compute F(q) with one LLM call, evaluate p̂_c(F(q)) for each retained config, and select argmax p̂_c − λ·mean_cost(c).
  • Ablations: compared against embeddings for query representation, different characterizer LLMs and sizes, and different predictor families.

Implications for AI Economics

  • Large operational cost savings per workload: BRANE demonstrates that tailoring the full pipeline per query (not just LLM) unlocks substantial reductions in dollar cost while preserving or improving accuracy. For providers and applications with high query volume, the amortized savings can quickly outweigh the upfront profiling expense.
  • ROI tradeoffs: adopting BRANE requires an upfront profiling and predictor-training investment (time, compute, monetary). The paper quantifies a profiling example (∼$11k). Decision-makers should compare that one-time cost against expected per-query savings × projected query volume to compute payback period.
  • Pricing and SLAs: BRANE exposes an explicit mapping from accuracy targets to expected cost (via λ calibration). This enables:
    • Tiered product offerings (e.g., accuracy SLAs at different price points).
    • Dynamic pricing: select cheaper pipelines for soft-SLA traffic and more expensive ones for premium traffic.
    • Better capacity planning because cost and expected accuracy are predictable per query class.
  • Product design & provider incentives:
    • Cloud/LLM providers could offer configuration catalogs or expose per-knob pricing and performance telemetry to enable Query2Conf-style optimization.
    • Marketplaces could productize “config-as-a-service” where providers offer profiled pipelines and BRANE-like routing as a managed service.
  • Operational risk and governance:
    • Distribution shift: workload shifts (new query types or corpus changes) require re-profiling or online adaptation; failure to do so risks misrouting and SLA violations.
    • Profiling cost and privacy: profiling may be expensive and potentially sensitive if the corpus/queries are private; organizations must balance privacy and cost.
    • Latency and throughput: BRANE optimizes dollar cost vs accuracy; system operators must incorporate latency/throughput constraints into the design (future work).
  • Research & product directions that affect economics:
    • Lower-cost profiling methods (active sampling, transfer learning across similar workloads) would reduce upfront cost and improve adoption.
    • Joint optimization including latency, throughput, and monetary cost would better reflect production tradeoffs.
    • Mechanisms to sell or share profiled performance curves could enable cross-customer amortization and cheaper adoption for low-volume users.

Limitations to consider before deployment: BRANE requires workload-specific profiling and offline training, uses a single LLM call per inference to obtain characterization (adds per-query cost), and uses mean profiled cost as a proxy for per-query cost. These are practical tradeoffs that must be weighed against the demonstrated per-query savings in production settings.

Assessment

Paper Typeother Evidence Strengthmedium — The paper reports controlled experimental comparisons across three standard retrieval-QA benchmarks (MuSiQue, BrowseComp-Plus, FinanceBench) and multiple baseline approaches, showing consistent Pareto improvements and large cost reductions; however, results are limited to these benchmarks and to the authors' cost model and pipeline catalog, leaving external validity and production robustness uncertain. Methods Rigormedium — The approach is evaluated against reasonable baselines (LLM-routing, rule-based, fine-tuned model) on multiple datasets and reports cost-quality frontiers, but the paper likely relies on dataset-specific labeled correctness, specific cost assumptions, and a fixed pipeline catalog; details on hyperparameter sensitivity, ablations, and real-world latency/throughput tradeoffs appear limited, which constrains assessment of robustness and reproducibility. SampleExperimental evaluations use three retrieval-augmented QA benchmarks: MuSiQue, BrowseComp-Plus, and FinanceBench; configurations come from a predefined catalog varying LLM, retriever, number of retrieved documents, number of hops, and synthesis strategy; BRANE trains lightweight per-configuration predictors using query features produced by an LLM and labeled correctness outcomes from those benchmarks. Themesproductivity adoption GeneralizabilityEvaluated only on retrieval-augmented QA benchmarks — may not generalize to other task types (e.g., summarization, dialog, code generation)., Performance depends on the specific pipeline catalog (models, retrievers, knobs) used in experiments; different catalogs could change conclusions., Cost model and measured 'cost' likely reflect token/compute proxies and may not capture real-world pricing, latency, or infrastructure constraints., Requires labeled correctness or sufficient calibration data per workload to train predictors, which may be costly in new domains., Uses an LLM to extract query features; feature quality and resulting performance may vary with the choice or version of backbone LLM.

Claims (8)

ClaimDirectionConfidenceOutcomeDetails
Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Other mixed high configuration choices' effect on answer quality and serving cost
0.02
Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. Other neutral medium configuration tuning practice (workload-level vs per-query)
0.01
We propose BRANE, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. Other neutral high method (feature extraction and predictor training)
0.02
At inference time, BRANE selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Other neutral high cost-quality tradeoff exposed by selection strategy
0.02
Across MuSiQue, BrowseComp-Plus, and FinanceBench, BRANE consistently pushes the cost-quality Pareto frontier. Organizational Efficiency positive high cost-quality tradeoff / Pareto frontier position
0.12
BRANE matches the best fixed configuration's accuracy at up to 89% lower cost. Organizational Efficiency positive high inference serving cost while maintaining accuracy
up to 89% lower cost
0.12
BRANE outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. Output Quality positive high answer quality (accuracy) and/or cost-quality tradeoff relative to baselines
0.12
These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning. Organizational Efficiency positive high practicality of per-query configuration vs static tuning
0.12

Notes