Retrieval-augmented chat assistants boost human task accuracy across small and large LLMs; but users don't rate larger models as more usable or satisfying, revealing a gap between technical performance gains and user perception.
Much research on LLMs has focused on increasing benchmark performance. However, the evaluation of such models in real-world collaborative human-AI workflows has stayed behind. This work evaluates a chatbot-style assistant based on Retrieval-Augmented Generation (RAG) in a realistic multi-turn information-seeking scenario inspired by workplace settings where compliance with local legislation and secure handling of sensitive data are often key. Specifically, we examine the performance of humans (N=112) assisted by RAG-assistants compared to LLM-only or LLM+RAG baselines. In this setting, we investigate how underlying model size (3B, 8B, and 70B) shapes the human-AI collaborative dynamic and how it influences perceived usability and satisfaction. Results show that the performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size, suggesting that hybrid systems are beneficial in information-seeking scenarios. Interestingly, however, perceived usability and satisfaction among participants showed little difference across model sizes. This demonstrates a nuanced trade-off between model size, performance, and user perception. Our work highlights the added value of evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, rather than focusing on benchmark performance only.
Summary
Main Finding
Human-AI collaboration using a retrieval-augmented generation (RAG) assistant significantly improves task accuracy over model-only baselines, and this improvement holds across generator model sizes (3B, 8B, 70B). However, participants’ perceived usability and satisfaction did not differ meaningfully by model size—showing a trade-off where larger models give limited incremental gains in user perception within a multi-turn, document-driven workflow.
Key Points
- Experiment: realistic multi-turn information-seeking task (answering 9 detailed questions about a 406-page flight manual). Participants could use the PDF, a RAG chatbot, or both.
- Comparison conditions: LLM-only (no retrieval), LLM+RAG single-shot baseline (standard benchmark-style), and human-AI multi-turn interaction with the RAG-assistant.
- Models tested: three instruction-tuned Llama3-family sizes as the generator: 3B (Llama3.2), 8B (Llama3.1), 70B (Llama3.3). All shared same data cutoff (Dec 2023) and temperature = 1.
- Main quantitative result: human-AI (multi-turn with RAG) substantially outperformed the model-only baselines irrespective of model size.
- Subjective measures: perceived usability and satisfaction across participants showed little or no meaningful differences between the three model sizes.
- Practical implication: RAG + human-in-the-loop interaction can close performance gaps between small and large generators for document-grounded tasks; larger parameter counts offer limited gains in user satisfaction in this setting.
Data & Methods
- Participants: recruited via Prolific; initial N = 120, after exclusions (used external tools) N = 112. Allocations: 37 (3B), 37 (8B), 38 (70B). Demographics: mean age ≈ 39.4 (SD 10.3), balanced gender split reported.
- Task details: 9 questions varying in difficulty and dispersion of relevant facts across the flight manual; all answers are present in the manual. Participants were pre-screened to exclude domain experts.
- RAG pipeline:
- Source doc chunked (chunk size 1024) → 2,497 chunks.
- Embeddings: intfloat/multilingual-e5-large.
- Retrieval: top-5 chunks by cosine similarity returned to generator.
- Generator receives system prompt + retrieved chunks; responses shown in chat and returned passages were visible to participants.
- Baselines and controls:
- LLM-only baseline: generator queried without retrieval; models generally failed to retrieve correct answers from internal memory alone.
- LLM+RAG baseline: single-shot input (no human reformulation / multi-turn).
- Human-AI: participants could iterate, reformulate prompts, consult RAG outputs, and consult PDF.
- Evaluation:
- Answers graded by two independent raters as correct (1), partially correct (0.5 where applicable), or incorrect (0).
- Accuracy computed as total score divided by possible points.
- Statistical analysis: one-sample t-test comparing baseline to human-AI performance; linear mixed-effects models to test model-size effects with random intercepts for participant and question ID; Tukey adjustment for multiple comparisons across three sizes.
- Other implementation notes:
- Generator temperature fixed at 1.
- Retriever model selection informed by a pre-study; intfloat-e5-large chosen for representational capacity though small retriever performed similarly.
- Ethics approval obtained; average participant pay ~£12.88 including bonuses.
Implications for AI Economics
- Cost-effectiveness and deployment:
- Smaller generators (3B, 8B) augmented with a good retrieval pipeline plus human interaction can deliver large practical gains in accuracy, reducing the case for always deploying very large (70B+) models in document-grounded workplace workflows.
- Because 70B models have much higher compute, memory, and energy costs, organizations may realize better cost-performance trade-offs by investing in retrieval infrastructure, prompt/workflow design, and human-in-the-loop processes rather than only scaling parameter count.
- Privacy, regulation, and local deployment:
- Smaller, locally deployable models combined with RAG support privacy-sensitive and compliance-driven use cases (e.g., EU AI Act), enabling on-prem or edge deployment with controlled data flows—an economically attractive alternative to relying on remote/closed commercial models.
- Productivity and labor implications:
- The measured accuracy gains from hybrid human-AI workflows suggest augmentation (not full automation) is the nearer-term economic effect: tasks can be completed more accurately and possibly faster when humans and RAG-assistants collaborate. Economic models of labor substitution should therefore account for enhanced productivity and changed task composition (more verification / decision oversight rather than pure retrieval).
- Procurement and investment priorities:
- Procurement decisions should consider total system costs: model size, inference hardware, embedding/retrieval costs, developer effort for integration, and human time. This study implies disproportionate returns from investing in retrieval quality, UX for multi-turn interaction, and human workflows.
- Environmental and policy considerations:
- Smaller models reduce energy and carbon footprint per inference; combined with RAG they can achieve similar application-level outcomes as larger models, supporting sustainability goals and potentially lowering regulatory or reputational risks tied to high-power model use.
- Recommendations for economic evaluation of AI projects:
- Evaluate AI systems in multi-turn, human-in-the-loop settings and include user-centric metrics (usability, satisfaction) and operational costs—not just benchmark accuracy. Cost-benefit analyses should explicitly model retrieval infrastructure and ongoing human effort required for verification.
- Caveats for economic interpretation:
- The study is a single-domain experiment (a flight manual) with open-weight Llama3 variants and specific retriever settings; transferability to other domain types, larger multi-document corpora, or different user populations requires confirmation.
- Exact compute and cost figures (inference latency, GPU/CPU requirements, energy use) were not reported; economic decisions should supplement these behavioral results with infrastructure cost estimates.
Suggested next steps for managers and economists: - Run pilot cost-models comparing (a) small-model+RAG+human workflow vs (b) large-model-only deployments, including hardware, energy, maintenance, and personnel costs. - Prioritize investment in retrieval quality, UX for multi-turn interactions, and human verification workflows where data is sensitive or regulatory compliance matters. - Expand trials to other domains and measure throughput, time-to-answer, and real-world error costs to refine ROI estimates.
Assessment
Claims (6)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| This work evaluates a chatbot-style assistant based on Retrieval-Augmented Generation (RAG) in a realistic multi-turn information-seeking scenario inspired by workplace settings where compliance with local legislation and secure handling of sensitive data are often key. Other | null_result | high | other |
n=112
1.0
|
| We examine the performance of humans (N=112) assisted by RAG-assistants compared to LLM-only or LLM+RAG baselines. Other | null_result | high | other |
n=112
1.0
|
| The performance gain of human-AI collaboration over the model-only baselines is significant, irrespective of model size. Output Quality | positive | high | task accuracy / performance |
n=112
0.6
|
| Perceived usability and satisfaction among participants showed little difference across model sizes. Worker Satisfaction | null_result | high | usability and satisfaction |
n=112
0.6
|
| Hybrid systems (human + RAG assistant) are beneficial in information-seeking scenarios. Output Quality | positive | high | task performance in information-seeking |
n=112
0.6
|
| Evaluating AI applications in actual multi-turn interactions with human users, looking at usability and satisfaction besides accuracy, provides added value compared to focusing on benchmark performance only. Other | positive | high | evaluation methodology value (usability, satisfaction, accuracy) |
n=112
0.1
|