Teams building LLM features use everything from informal 'vibe checks' to formal governance, but many cannot turn evaluation findings into concrete fixes — a persistent 'results-actionability gap' that slows LLM value capture; firms that invest in instrumentation and clear remediation pathways reap larger productivity gains.
How do product teams evaluate LLM-powered products? As organizations integrate large language models (LLMs) into digital products, their unpredictable nature makes traditional evaluation approaches inadequate, yet little is known about how practitioners navigate this challenge. Through interviews with nineteen practitioners across diverse sectors, we identify ten evaluation practices spanning informal 'vibe checks' to organizational meta-work. Beyond confirming four documented challenges, we introduce a novel fifth we call the results-actionability gap, in which practitioners gather evaluation data but cannot translate findings into concrete improvements. Drawing on patterns from successful teams, we contribute strategies to bridge this gap, supporting practitioners' formalization journey from ad-hoc interpretive practices (e.g., vibe checks) toward systematic evaluation. Our analysis suggests these interpretive practices are necessary adaptations to LLM characteristics rather than methodological failures. For HCI researchers, this presents a research opportunity to support practitioners in systematizing emerging practices rather than developing new evaluation frameworks.
Summary
Main Finding
Practitioners evaluating LLM-powered products rely on a mix of ad-hoc, interpretive practices (e.g., “vibe checks”) and manual testing because existing metrics and benchmarks do not map well to product needs. The authors identify a novel "results-actionability gap": teams can collect evaluation data but cannot translate findings into concrete fixes because they cannot reliably attribute failures to prompts, retrieval, or the model. The paper argues these interpretive practices are necessary adaptations to LLM properties and recommends organizational/process changes (not new metrics) to make evaluation actionable.
Key Points
- Sample & scope: 19 semi-structured interviews (Feb–May 2025) with practitioners (designers, engineers, researchers, data scientists, marketing) across startups to Fortune 500s and sectors such as healthcare, education, legal, enterprise software. Most use foundation models via APIs and evaluate product-level systems (model + UI + retrieval).
- Ten evaluation practices were observed, ranging from informal to organizational, e.g.:
- “Vibe checks” / ad-hoc manual inspection
- Manual test suites and exploratory testing
- Mix of qualitative & quantitative metrics
- LLM-as-judge / automated human-like scoring
- Health checks and monitoring in production
- A/B tests and targeted user studies
- Expert review and domain-specific checks
- Retrieval/QA-specific tests
- Prompt-level experiments
- Organizational meta-work (governance, roles, processes)
- Confirms four well-documented challenges from prior work:
- Non-determinism leading to inconsistent test results
- Component entanglement / inability to isolate root causes
- Benchmarks and standardized metrics mismatch product context
- Human evaluation is costly, inconsistent, and hard to scale
- Novel contribution: the results-actionability gap — evaluation yields signals (bugs, low scores, user complaints) but teams struggle to map signals to actionable interventions because failures can arise anywhere in the pipeline (prompt, retrieval/database, model behavior, UI).
- Interpretive practices (e.g., vibe checks) are not merely poor practice or transitional stopgaps; they are adaptive responses to probabilistic, context-sensitive LLM outputs and therefore should be supported and systematized rather than dismissed.
- Patterns from successful teams point to organizational and process strategies (rather than inventing new metrics) to make evaluation more actionable and systematic.
Data & Methods
- Participants: 19 practitioners (5 female, 14 male) with roles spanning research, design, engineering, data science, and marketing; diverse sectors and organization sizes.
- Recruitment: professional networks, LinkedIn/BlueSky, conferences.
- Procedure: 45–60 minute semi-structured Zoom interviews; audio-recorded and transcribed (Amberscript); participants received early access to outcomes; ethics approval obtained.
- Analysis: Reflexive thematic analysis (Braun & Clarke). Initial independent coding by two authors, iterative codebook development (four iterations), team discussions to consolidate and refine themes; interviews continued until thematic saturation.
- Focus: product-level evaluation (systems in production) rather than model-benchmarking in research settings.
Implications for AI Economics
- Evaluation as a hidden cost: The heavy reliance on manual, interpretive evaluation and the results-actionability gap mean substantial labor and coordination costs (engineering time, expert review, user testing). This raises the effective cost of deploying LLM features beyond API model fees.
- Returns to investment in evaluation infrastructure: Organizations that invest in instrumentation, modular testing, and cross-functional processes may gain faster iteration cycles, lower failure rates, and higher product-market fit — implying increasing returns to organizational capability and potentially widening gaps between well-resourced incumbents and resource-constrained entrants.
- Risk, liability, and valuation: Lack of actionable evaluation can increase product risk (regulatory, reputational, legal), affecting firm valuations, insurance costs, and due diligence. Economists and investors should account for evaluation maturity when assessing AI-enabled ventures.
- Productivity and diffusion: The paper suggests that benefits from LLMs are mediated by the ability to evaluate and improve systems; without actionable evaluation, expected productivity gains may not materialize. This modulates macro-level expectations about AI-driven productivity boosts.
- Market structure and specialization: Because evaluation requires cross-functional coordination and process changes more than new measurement science, there may be demand for specialized service providers (evaluation tooling, governance consulting) — creating new markets and complementarities around LLM deployment.
- Policy and regulation implications: Regulators aiming to require testing or transparency should consider the results-actionability gap; mandates to “test models” without guidance on attribution and remediation risk imposing costs without improving safety. Policy design should account for organizational capabilities and incentivize investments in modular testing and root-cause analysis.
- Research priorities for economics of AI: Study the cost–benefit of different evaluation investments (e.g., lightweight instrumentation, modular pipelines, governance roles), measure how evaluation maturity affects adoption rates, and model how evaluation friction influences competition and diffusion across firm sizes.
If useful, I can extract the ten evaluation practices verbatim from the paper’s appendix or produce a checklist of practical organizational steps (with tradeoffs and estimated resource costs) that teams can adopt to reduce the results-actionability gap.
Assessment
Claims (15)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Product teams evaluating LLM-powered features rely on a spectrum of practices—from informal “vibe checks” to organizational meta-work—to cope with LLMs’ unpredictability. Team Performance | mixed | medium-high | types of evaluation practices used by product teams |
n=19
0.01
|
| The authors identify ten evaluation practices that teams use, ranging from lightweight interpretive checks to formal organizational processes (examples: qualitative user reviews, red-team testing, A/B experiments, telemetry/log analysis, structured annotation, governance/meta-evaluation). Team Performance | mixed | high | taxonomy/count and description of evaluation practices |
n=19
10 practices
0.09
|
| The study confirms several previously documented evaluation challenges with LLMs: model unpredictability, metric mismatch, high human-evaluation costs, and difficulty reproducing failures. Error Rate | negative | medium-high | presence and prevalence of known evaluation challenges |
n=19
0.01
|
| Teams often produce evaluation outputs (tests, metrics, user feedback) but lack mechanisms, processes, or technical levers to convert those outputs into actionable engineering or product changes—a novel “results-actionability gap.” Organizational Efficiency | negative | medium-high | ability to translate evaluation outputs into concrete product/engineering changes |
n=19
0.01
|
| Successful teams close the results-actionability gap by systematizing interpretive practices and creating clearer pathways from evaluation signals to product changes. Organizational Efficiency | positive | medium | degree to which evaluation leads to implemented product changes |
n=19
0.05
|
| Interpretive, ad-hoc human-centered evaluation practices (e.g., “vibe checks”, team sense-making) are rational adaptations to LLM behavior rather than merely sloppy or inferior methodological choices. Team Performance | neutral | medium | characterization of interpretive evaluation practices (rational adaptation vs. methodological failure) |
n=19
0.05
|
| Effective teams tend to evolve from ad-hoc interpretive methods toward systematic evaluation by (a) formalizing prompts/tests, (b) instrumenting outputs, (c) mapping failure modes to remediation paths, and (d) creating organizational decision rules. Organizational Efficiency | positive | medium | process maturity in evaluation practices (ad-hoc to systematic) |
n=19
0.05
|
| The study method consisted of semi-structured qualitative interviews with 19 practitioners across multiple industries and roles, analyzed via thematic coding. Research Productivity | null_result | high | study design and sample size |
n=19
0.09
|
| The paper produces as primary outcomes a taxonomy of ten evaluation practices, the articulation of the results-actionability gap, and recommended strategies observed among successful teams. Research Productivity | null_result | high | reported study outputs (taxonomy, articulated gap, recommended strategies) |
n=19
0.09
|
| Measurement friction from the results-actionability gap creates a hidden cost: teams can detect problems but cannot cheaply translate findings into improvements, reducing the speed and ROI of LLM investments. Firm Productivity | negative | low | inferred effect on ROI and speed of product improvement |
n=19
0.03
|
| Firms that invest in instrumentation, cross-functional processes, and remediation levers capture more value from LLMs; organizations with better evaluation-to-action pipelines will obtain higher productivity gains and market edge. Firm Productivity | positive | low | relative productivity/value capture tied to evaluation-to-action capability (inferred) |
n=19
0.03
|
| The persistence of interpretive, human-in-the-loop evaluation implies ongoing labor requirements (annotation, sense-making, governance roles), affecting forecasts of automation and labor substitution in sectors adopting LLMs. Automation Exposure | negative | medium | continued human labor requirements for evaluation |
n=19
0.05
|
| There is demand for tooling that bridges evaluation outputs to actionable fixes (e.g., failure-mode libraries, standardized remediation templates, evaluation-to-priority mapping), signaling economic opportunities for third-party tools and consulting services. Adoption Rate | positive | low | inferred market demand for evaluation-to-action tooling/services |
n=19
0.03
|
| Regulators and standard-setters who value transparency and auditability will need to account for the gap between evaluation results and actionable fixes; firms may require incentives or rules to ensure evaluation leads to remediation, not just documentation. Governance And Regulation | neutral | low | policy/regulatory effectiveness regarding evaluation leading to remediation (speculative) |
n=19
0.03
|
| The authors propose research priorities for economists: quantify productivity gains from closing the actionability gap; estimate firm-level heterogeneity in evaluation capability and its effect on adoption; and model investment trade-offs between building evaluation-to-action pipelines versus accepting reduced LLM performance. Research Productivity | null_result | high | recommended research agenda topics |
0.09
|