How you apply PRF matters more than where it comes from: across 13 low-resource retrieval tasks, the feedback model drives retrieval gains more than feedback source; LLM-generated feedback is the cheapest high-return option unless a high-quality first-stage retriever supplies strong candidate documents, in which case corpus-derived feedback outperforms.

A Systematic Study of Pseudo-Relevance Feedback with LLMs

Nour Jedidi, Jimmy Lin · March 11, 2026

arxiv other medium evidence 7/10 relevance Source PDF

In LLM-based pseudo-relevance feedback, how feedback is used (the feedback model) often matters more for retrieval quality than where it comes from, and while LLM-generated feedback is the most cost-effective default, corpus-derived feedback yields extra gains when paired with a strong first-stage retriever.

Pseudo-relevance feedback (PRF) methods built on large language models (LLMs) can be organized along two key design dimensions: the feedback source, which is where the feedback text is derived from and the feedback model, which is how the given feedback text is used to refine the query representation. However, the independent role that each dimension plays is unclear, as both are often entangled in empirical evaluations. In this paper, we address this gap by systematically studying how the choice of feedback source and feedback model impact PRF effectiveness through controlled experimentation. Across 13 low-resource BEIR tasks with five LLM PRF methods, our results show: (1) the choice of feedback model can play a critical role in PRF effectiveness; (2) feedback derived solely from LLM-generated text provides the most cost-effective solution; and (3) feedback derived from the corpus is most beneficial when utilizing candidate documents from a strong first-stage retriever. Together, our findings provide a better understanding of which elements in the PRF design space are most important.

Summary

Main Finding

When using LLM-based pseudo-relevance feedback (PRF), the choice of feedback model (how feedback is applied) critically affects retrieval effectiveness, and feedback source matters differently depending on context: LLM-generated feedback is the most cost-effective overall, while corpus-derived feedback helps most when candidate documents come from a strong first-stage retriever.

Key Points

PRF design decomposes into two independent dimensions:
- Feedback source: where the feedback text comes from (e.g., LLM-generated text vs. text drawn from the corpus).
- Feedback model: how that feedback text is used to refine the query representation.
Prior work often conflates these two dimensions; this study isolates them through controlled experiments.
Across 13 low-resource BEIR tasks and five LLM PRF methods:
- Feedback model choice can have a larger impact on retrieval quality than feedback source.
- Purely LLM-generated feedback yields the best cost-effectiveness (good performance for lower cost).
- Corpus-derived feedback becomes most useful only when the retrieval pipeline already supplies strong candidate documents from a high-quality first-stage retriever.
The results clarify which elements of the PRF design space are most important to prioritize in practice.

Data & Methods

Tasks: 13 low-resource retrieval tasks from the BEIR benchmark suite.
Methods: Evaluation of five LLM-based PRF methods, systematically varying:
- Feedback source (LLM-generated text vs. corpus-derived text).
- Feedback model (the mechanism that incorporates feedback into query refinement).
Experimental design: Controlled experiments that disentangle the independent effects of source and model, and that examine performance under differing strengths of the first-stage retriever.
Metrics and costs: Effectiveness measured by standard retrieval metrics (as typical in BEIR studies); cost-effectiveness assessed by considering the tradeoff between LLM invocation cost and retrieval gains.

Implications for AI Economics

Cost allocation: Organizations should consider LLM-generated feedback as a high-return, lower-cost PRF option for low-resource retrieval tasks, which can reduce expenses tied to corpus annotation or expensive retrieval pipelines.
Investment priorities: Greater ROI may come from investing in better feedback models (how to use feedback) than solely collecting richer feedback sources. Improving the feedback-model component can yield larger performance gains.
System design trade-offs: If investing in a strong first-stage retriever is feasible, augmenting it with corpus-derived feedback can further improve outcomes; otherwise, LLM-generated feedback is the more economical default.
Adoption strategy: Firms and platforms deploying retrieval-augmented systems should evaluate the marginal benefit per dollar of stronger retrievers versus more sophisticated feedback-models or LLM calls when designing retrieval stacks.
Policy and accessibility: Cost-effective LLM-generated PRF lowers the barrier to building competitive retrieval systems in low-resource domains, which can democratize access to advanced search tools across smaller organizations and research groups.

Assessment

Paper Typeother Evidence Strengthmedium — The study uses systematic, replicated experiments across 13 benchmark tasks and multiple PRF methods, giving credible within-sample evidence about relative method performance and cost-effectiveness; however, external validity is limited by the benchmark tasks, specific LLMs/pricing assumptions, and the finite set of PRF models evaluated, so causal claims about broader real-world deployment are somewhat conditional. Methods Rigorhigh — The authors isolate two orthogonal design factors in a controlled factorial setup, evaluate across many tasks (BEIR suite), test five PRF methods, and examine sensitivity to first-stage retriever strength and LLM costs, which reflects careful experimental design, systematic comparisons, and attention to robustness and cost trade-offs. SampleEmpirical evaluation on 13 low-resource retrieval tasks from the BEIR benchmark suite; five distinct LLM-based pseudo-relevance feedback (PRF) methods were implemented; experiments vary feedback source (LLM-generated vs. corpus-derived), feedback model, and first-stage retriever strength; effectiveness measured by standard retrieval metrics and cost-effectiveness estimated using LLM invocation costs. Themesadoption org_design productivity IdentificationControlled computational experiments that factorially vary two design dimensions—feedback source (LLM-generated vs. corpus-derived) and feedback model (mechanism for incorporating feedback)—while holding other pipeline components constant; performance is compared across 13 BEIR low-resource retrieval tasks and under varying first-stage retriever strengths, with cost-effectiveness evaluated by trading LLM invocation cost against retrieval gains. GeneralizabilityResults are specific to the 13 low-resource BEIR tasks and may not hold for large-scale web search or high-resource domains, Findings depend on the particular LLM(s) and pricing assumptions used to estimate cost-effectiveness, Only five PRF methods were tested; other feedback models or hybrid approaches might perform differently, Languages, domain-specific corpora, and user-interaction/latency considerations were likely not fully represented, Performance interaction with very different first-stage retriever architectures or large multi-stage pipelines may differ

Claims (10)

Claim	Direction	Confidence	Outcome	Details
PRF design decomposes into two independent dimensions: feedback source (where feedback text comes from) and feedback model (how that feedback is used to refine the query). Other	positive	high	PRF design components (feedback source vs. feedback model)	0.12
Prior work often conflates feedback source and feedback model; this study isolates them through controlled experiments. Other	negative	medium	Degree to which prior studies separate PRF design dimensions (methodological assessment)	0.07
Feedback model choice can have a larger impact on retrieval quality than feedback source. Output Quality	positive	medium	Retrieval effectiveness (standard BEIR retrieval metrics)	n=13 0.07
Purely LLM-generated feedback yields the best cost-effectiveness overall (best performance per unit LLM invocation cost) for low-resource retrieval tasks. Organizational Efficiency	positive	medium	Cost-effectiveness (retrieval gains per LLM invocation cost)	n=13 0.07
Corpus-derived feedback becomes most useful only when the retrieval pipeline already supplies strong candidate documents from a high-quality first-stage retriever. Output Quality	mixed	medium	Retrieval effectiveness conditional on first-stage retriever quality	n=13 0.07
Across 13 low-resource BEIR tasks and five LLM PRF methods, the choice of feedback model (how feedback is applied) critically affects retrieval effectiveness. Output Quality	positive	medium	Retrieval effectiveness (standard BEIR metrics)	n=13 0.07
The study's results clarify which elements of the PRF design space are most important to prioritize in practice (i.e., prioritize feedback-model improvements over source collection in many low-resource settings). Output Quality	positive	medium	Relative impact on retrieval performance and cost-effectiveness	n=13 0.07
Organizations should consider LLM-generated feedback as a high-return, lower-cost PRF option for low-resource retrieval tasks to reduce expenses tied to corpus annotation or expensive retrieval pipelines. Organizational Efficiency	positive	low	Economic metric: return (retrieval gains) per dollar spent on LLM invocations or corpus annotation	n=13 0.04
Greater ROI may come from investing in better feedback models (how to use feedback) than solely collecting richer feedback sources. Organizational Efficiency	positive	medium	Return on investment (performance improvement per resource invested in model vs. source)	n=13 0.07
If investing in a strong first-stage retriever is feasible, augmenting it with corpus-derived feedback can further improve outcomes; otherwise, LLM-generated feedback is the more economical default. Output Quality	mixed	medium	Retrieval effectiveness and cost-effectiveness conditional on first-stage retriever strength	n=13 0.07