Retrieval-augmented chat assistants boost human task accuracy across small and large LLMs; but users don't rate larger models as more usable or satisfying, revealing a gap between technical performance gains and user perception.

Seeking Information with RAG-Assistants: Does Model Size Matter in Human-AI Collaborations?

Lennard C. Froma, Tom Kouwenhoven, Maaike H. T. de Boer, Catholijn M. Jonker, Max J. van Duijn · May 01, 2026

arxiv rct medium evidence 7/10 relevance Source PDF

RAG-augmented chat assistants significantly improve human performance in multi-turn, workplace-style information-seeking tasks across model sizes, while perceived usability and satisfaction do not increase with larger models.

Much research on LLMs has focused on increasing benchmark performance. However, the evaluation of such models in real-world collaborative human-AI workflows has stayed behind. This work evaluates a chatbot-style assistant based on Retrieval-Augmented Generation (RAG) in a realistic multi-turn information-seeking scenario inspired by workplace settings where compliance with local legislation and secure handling of sensitive data are often key. Specifically, we examine the performance of humans (N=112) assisted by RAG-assistants compared to LLM-only or LLM+RAG baselines. In this setting, we investigate how underlying model size (3B, 8B, and 70B) shapes the human-AI collaborative dynamic and how it influences perceived usability and satisfaction. Results show that the performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size, suggesting that hybrid systems are beneficial in information-seeking scenarios. Interestingly, however, perceived usability and satisfaction among participants showed little difference across model sizes. This demonstrates a nuanced trade-off between model size, performance, and user perception. Our work highlights the added value of evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, rather than focusing on benchmark performance only.

Summary

Main Finding

Human-AI collaboration using a retrieval-augmented generation (RAG) assistant significantly improves task accuracy over model-only baselines, and this improvement holds across generator model sizes (3B, 8B, 70B). However, participants’ perceived usability and satisfaction did not differ meaningfully by model size—showing a trade-off where larger models give limited incremental gains in user perception within a multi-turn, document-driven workflow.

Key Points

Experiment: realistic multi-turn information-seeking task (answering 9 detailed questions about a 406-page flight manual). Participants could use the PDF, a RAG chatbot, or both.
Comparison conditions: LLM-only (no retrieval), LLM+RAG single-shot baseline (standard benchmark-style), and human-AI multi-turn interaction with the RAG-assistant.
Models tested: three instruction-tuned Llama3-family sizes as the generator: 3B (Llama3.2), 8B (Llama3.1), 70B (Llama3.3). All shared same data cutoff (Dec 2023) and temperature = 1.
Main quantitative result: human-AI (multi-turn with RAG) substantially outperformed the model-only baselines irrespective of model size.
Subjective measures: perceived usability and satisfaction across participants showed little or no meaningful differences between the three model sizes.
Practical implication: RAG + human-in-the-loop interaction can close performance gaps between small and large generators for document-grounded tasks; larger parameter counts offer limited gains in user satisfaction in this setting.

Data & Methods

Participants: recruited via Prolific; initial N = 120, after exclusions (used external tools) N = 112. Allocations: 37 (3B), 37 (8B), 38 (70B). Demographics: mean age ≈ 39.4 (SD 10.3), balanced gender split reported.
Task details: 9 questions varying in difficulty and dispersion of relevant facts across the flight manual; all answers are present in the manual. Participants were pre-screened to exclude domain experts.
RAG pipeline:
- Source doc chunked (chunk size 1024) → 2,497 chunks.
- Embeddings: intfloat/multilingual-e5-large.
- Retrieval: top-5 chunks by cosine similarity returned to generator.
- Generator receives system prompt + retrieved chunks; responses shown in chat and returned passages were visible to participants.
Baselines and controls:
- LLM-only baseline: generator queried without retrieval; models generally failed to retrieve correct answers from internal memory alone.
- LLM+RAG baseline: single-shot input (no human reformulation / multi-turn).
- Human-AI: participants could iterate, reformulate prompts, consult RAG outputs, and consult PDF.
Evaluation:
- Answers graded by two independent raters as correct (1), partially correct (0.5 where applicable), or incorrect (0).
- Accuracy computed as total score divided by possible points.
- Statistical analysis: one-sample t-test comparing baseline to human-AI performance; linear mixed-effects models to test model-size effects with random intercepts for participant and question ID; Tukey adjustment for multiple comparisons across three sizes.
Other implementation notes:
- Generator temperature fixed at 1.
- Retriever model selection informed by a pre-study; intfloat-e5-large chosen for representational capacity though small retriever performed similarly.
- Ethics approval obtained; average participant pay ~£12.88 including bonuses.

Implications for AI Economics

Cost-effectiveness and deployment:
- Smaller generators (3B, 8B) augmented with a good retrieval pipeline plus human interaction can deliver large practical gains in accuracy, reducing the case for always deploying very large (70B+) models in document-grounded workplace workflows.
- Because 70B models have much higher compute, memory, and energy costs, organizations may realize better cost-performance trade-offs by investing in retrieval infrastructure, prompt/workflow design, and human-in-the-loop processes rather than only scaling parameter count.
Privacy, regulation, and local deployment:
- Smaller, locally deployable models combined with RAG support privacy-sensitive and compliance-driven use cases (e.g., EU AI Act), enabling on-prem or edge deployment with controlled data flows—an economically attractive alternative to relying on remote/closed commercial models.
Productivity and labor implications:
- The measured accuracy gains from hybrid human-AI workflows suggest augmentation (not full automation) is the nearer-term economic effect: tasks can be completed more accurately and possibly faster when humans and RAG-assistants collaborate. Economic models of labor substitution should therefore account for enhanced productivity and changed task composition (more verification / decision oversight rather than pure retrieval).
Procurement and investment priorities:
- Procurement decisions should consider total system costs: model size, inference hardware, embedding/retrieval costs, developer effort for integration, and human time. This study implies disproportionate returns from investing in retrieval quality, UX for multi-turn interaction, and human workflows.
Environmental and policy considerations:
- Smaller models reduce energy and carbon footprint per inference; combined with RAG they can achieve similar application-level outcomes as larger models, supporting sustainability goals and potentially lowering regulatory or reputational risks tied to high-power model use.
Recommendations for economic evaluation of AI projects:
- Evaluate AI systems in multi-turn, human-in-the-loop settings and include user-centric metrics (usability, satisfaction) and operational costs—not just benchmark accuracy. Cost-benefit analyses should explicitly model retrieval infrastructure and ongoing human effort required for verification.
Caveats for economic interpretation:
- The study is a single-domain experiment (a flight manual) with open-weight Llama3 variants and specific retriever settings; transferability to other domain types, larger multi-document corpora, or different user populations requires confirmation.
- Exact compute and cost figures (inference latency, GPU/CPU requirements, energy use) were not reported; economic decisions should supplement these behavioral results with infrastructure cost estimates.

Suggested next steps for managers and economists: - Run pilot cost-models comparing (a) small-model+RAG+human workflow vs (b) large-model-only deployments, including hardware, energy, maintenance, and personnel costs. - Prioritize investment in retrieval quality, UX for multi-turn interactions, and human verification workflows where data is sensitive or regulatory compliance matters. - Expand trials to other domains and measure throughput, time-to-answer, and real-world error costs to refine ROI estimates.

Assessment

Paper Typerct Evidence Strengthmedium — Randomized assignment supports causal claims about the effect of assistance type on task performance, but the sample is modest (N=112) and outcomes are measured in a simulated/experimental information-seeking task rather than real workplace productivity or firm-level outcomes, limiting external validity. Methods Rigormedium — Design includes multi-turn interactions and multiple model sizes and compares objective performance and subjective usability, but the description lacks detail about participant recruitment, blinding, pre-registration, power calculations, and robustness checks; single-task domain and potential learning or demand effects reduce methodological robustness. Sample112 human participants interacting with chatbot-style assistants in realistic multi-turn information-seeking tasks inspired by workplace scenarios involving legal compliance and sensitive data handling; conditions compared RAG-assisted systems to LLM-only and LLM+RAG baselines across three model sizes (3B, 8B, 70B); measured objective task performance and self-reported usability/satisfaction. Themeshuman_ai_collab productivity adoption IdentificationRandomized experiment assigning human participants (N=112) to assistive conditions (RAG-assisted chatbot vs LLM-only vs LLM+RAG baseline) and comparing task accuracy and survey measures across groups and model sizes (3B, 8B, 70B). Causal inference rests on random assignment to conditions and direct measurement of outcomes in multi-turn interactions. GeneralizabilityModest, non-representative sample (N=112); likely convenience sampling (e.g., online panel) limits population inference, Experimental tasks simulate but do not fully reproduce real workplace complexity, stakes, or long-term workflows, Only one task domain (information-seeking/compliance); results may not transfer to creative, coding, or operational tasks, Specific RAG implementation and model families tested — results may not hold for different retrieval systems or model architectures, Short-term interactions — no evidence on long-run learning, adaptation, or productivity changes over time, Language, cultural, and regulatory context may restrict applicability across regions

Claims (6)

Claim	Direction	Confidence	Outcome	Details
This work evaluates a chatbot-style assistant based on Retrieval-Augmented Generation (RAG) in a realistic multi-turn information-seeking scenario inspired by workplace settings where compliance with local legislation and secure handling of sensitive data are often key. Other	null_result	high	other	n=112 1.0
We examine the performance of humans (N=112) assisted by RAG-assistants compared to LLM-only or LLM+RAG baselines. Other	null_result	high	other	n=112 1.0
The performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size. Output Quality	positive	high	task accuracy / performance	n=112 0.6
Perceived usability and satisfaction among participants showed little difference across model sizes. Worker Satisfaction	null_result	high	usability and satisfaction	n=112 0.6
Hybrid systems (human + RAG assistant) are beneficial in information-seeking scenarios. Output Quality	positive	high	task performance in information-seeking	n=112 0.6
Evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, provides added value compared to focusing on benchmark performance only. Other	positive	high	evaluation methodology value (usability, satisfaction, accuracy)	n=112 0.1