The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Teams building LLM features use everything from informal 'vibe checks' to formal governance, but many cannot turn evaluation findings into concrete fixes — a persistent 'results-actionability gap' that slows LLM value capture; firms that invest in instrumentation and clear remediation pathways reap larger productivity gains.

Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild
Willem; id_orcid 0000-0003-0245-1633 van der Maden, Malak Sadek, Ziang Xiao, Aske ; id_orcid 0000-0003-1827-8513 Mottelson, Q. Vera Liao, Jichen Zhu · April 13, 2026 · IT University Of Copenhagen (IT University of Copenhagen)
openalex descriptive low evidence 7/10 relevance DOI Source PDF
Practitioners use a spectrum of informal to formal evaluation practices for LLMs, but many teams face a 'results-actionability gap' where evaluation outputs do not translate into implementable product changes, and teams that systematize interpretive practices and remediation pipelines capture more value.

How do product teams evaluate LLM-powered products? As organizations integrate large language models (LLMs) into digital products, their unpredictable nature makes traditional evaluation approaches inadequate, yet little is known about how practitioners navigate this challenge. Through interviews with nineteen practitioners across diverse sectors, we identify ten evaluation practices spanning informal 'vibe checks' to organizational meta-work. Beyond confirming four documented challenges, we introduce a novel fifth we call the results-actionability gap, in which practitioners gather evaluation data but cannot translate findings into concrete improvements. Drawing on patterns from successful teams, we contribute strategies to bridge this gap, supporting practitioners' formalization journey from ad-hoc interpretive practices (e.g., vibe checks) toward systematic evaluation. Our analysis suggests these interpretive practices are necessary adaptations to LLM characteristics rather than methodological failures. For HCI researchers, this presents a research opportunity to support practitioners in systematizing emerging practices rather than developing new evaluation frameworks.

Summary

Main Finding

Practitioners evaluating LLM-powered products rely on a mix of ad-hoc, interpretive practices (e.g., “vibe checks”) and manual testing because existing metrics and benchmarks do not map well to product needs. The authors identify a novel "results-actionability gap": teams can collect evaluation data but cannot translate findings into concrete fixes because they cannot reliably attribute failures to prompts, retrieval, or the model. The paper argues these interpretive practices are necessary adaptations to LLM properties and recommends organizational/process changes (not new metrics) to make evaluation actionable.

Key Points

  • Sample & scope: 19 semi-structured interviews (Feb–May 2025) with practitioners (designers, engineers, researchers, data scientists, marketing) across startups to Fortune 500s and sectors such as healthcare, education, legal, enterprise software. Most use foundation models via APIs and evaluate product-level systems (model + UI + retrieval).
  • Ten evaluation practices were observed, ranging from informal to organizational, e.g.:
    • “Vibe checks” / ad-hoc manual inspection
    • Manual test suites and exploratory testing
    • Mix of qualitative & quantitative metrics
    • LLM-as-judge / automated human-like scoring
    • Health checks and monitoring in production
    • A/B tests and targeted user studies
    • Expert review and domain-specific checks
    • Retrieval/QA-specific tests
    • Prompt-level experiments
    • Organizational meta-work (governance, roles, processes)
  • Confirms four well-documented challenges from prior work:
    • Non-determinism leading to inconsistent test results
    • Component entanglement / inability to isolate root causes
    • Benchmarks and standardized metrics mismatch product context
    • Human evaluation is costly, inconsistent, and hard to scale
  • Novel contribution: the results-actionability gap — evaluation yields signals (bugs, low scores, user complaints) but teams struggle to map signals to actionable interventions because failures can arise anywhere in the pipeline (prompt, retrieval/database, model behavior, UI).
  • Interpretive practices (e.g., vibe checks) are not merely poor practice or transitional stopgaps; they are adaptive responses to probabilistic, context-sensitive LLM outputs and therefore should be supported and systematized rather than dismissed.
  • Patterns from successful teams point to organizational and process strategies (rather than inventing new metrics) to make evaluation more actionable and systematic.

Data & Methods

  • Participants: 19 practitioners (5 female, 14 male) with roles spanning research, design, engineering, data science, and marketing; diverse sectors and organization sizes.
  • Recruitment: professional networks, LinkedIn/BlueSky, conferences.
  • Procedure: 45–60 minute semi-structured Zoom interviews; audio-recorded and transcribed (Amberscript); participants received early access to outcomes; ethics approval obtained.
  • Analysis: Reflexive thematic analysis (Braun & Clarke). Initial independent coding by two authors, iterative codebook development (four iterations), team discussions to consolidate and refine themes; interviews continued until thematic saturation.
  • Focus: product-level evaluation (systems in production) rather than model-benchmarking in research settings.

Implications for AI Economics

  • Evaluation as a hidden cost: The heavy reliance on manual, interpretive evaluation and the results-actionability gap mean substantial labor and coordination costs (engineering time, expert review, user testing). This raises the effective cost of deploying LLM features beyond API model fees.
  • Returns to investment in evaluation infrastructure: Organizations that invest in instrumentation, modular testing, and cross-functional processes may gain faster iteration cycles, lower failure rates, and higher product-market fit — implying increasing returns to organizational capability and potentially widening gaps between well-resourced incumbents and resource-constrained entrants.
  • Risk, liability, and valuation: Lack of actionable evaluation can increase product risk (regulatory, reputational, legal), affecting firm valuations, insurance costs, and due diligence. Economists and investors should account for evaluation maturity when assessing AI-enabled ventures.
  • Productivity and diffusion: The paper suggests that benefits from LLMs are mediated by the ability to evaluate and improve systems; without actionable evaluation, expected productivity gains may not materialize. This modulates macro-level expectations about AI-driven productivity boosts.
  • Market structure and specialization: Because evaluation requires cross-functional coordination and process changes more than new measurement science, there may be demand for specialized service providers (evaluation tooling, governance consulting) — creating new markets and complementarities around LLM deployment.
  • Policy and regulation implications: Regulators aiming to require testing or transparency should consider the results-actionability gap; mandates to “test models” without guidance on attribution and remediation risk imposing costs without improving safety. Policy design should account for organizational capabilities and incentivize investments in modular testing and root-cause analysis.
  • Research priorities for economics of AI: Study the cost–benefit of different evaluation investments (e.g., lightweight instrumentation, modular pipelines, governance roles), measure how evaluation maturity affects adoption rates, and model how evaluation friction influences competition and diffusion across firm sizes.

If useful, I can extract the ten evaluation practices verbatim from the paper’s appendix or produce a checklist of practical organizational steps (with tradeoffs and estimated resource costs) that teams can adopt to reduce the results-actionability gap.

Assessment

Paper Typedescriptive Evidence Strengthlow — Findings are based on qualitative interviews with 19 practitioners and thematic coding; the study is exploratory and hypothesis-generating rather than causal or generalizable, relies on self-reported practices, and does not quantify prevalence or effect sizes. Methods Rigormedium — The study uses standard qualitative methods (semi-structured interviews and thematic coding) and appears to sample practitioners across roles and sectors, enabling rich process insights; however, the small sample, unspecified recruitment strategy, limited geographic/firm-size detail, and lack of triangulation with observational/quantitative data constrain rigor. SampleSemi-structured interviews with 19 practitioners from multiple industries and organizational roles who build or evaluate LLM-enabled products; analysis via thematic coding of interview transcripts; details on recruitment, geographic spread, and firm sizes are not fully reported. Themesorg_design productivity human_ai_collab adoption governance GeneralizabilitySmall sample (n=19) limits representativeness and ability to estimate prevalence of practices, Likely self-selection toward more engaged or resourced teams (selection bias), Unclear geographic and firm-size coverage — may overrepresent tech firms or US/Western contexts, Relies on practitioner self-report rather than systematic observational or outcome data, Findings identify patterns and mechanisms but do not establish causal links to productivity or economic outcomes

Claims (15)

ClaimDirectionConfidenceOutcomeDetails
Product teams evaluating LLM-powered features rely on a spectrum of practices—from informal “vibe checks” to organizational meta-work—to cope with LLMs’ unpredictability. Team Performance mixed medium-high types of evaluation practices used by product teams
n=19
0.01
The authors identify ten evaluation practices that teams use, ranging from lightweight interpretive checks to formal organizational processes (examples: qualitative user reviews, red-team testing, A/B experiments, telemetry/log analysis, structured annotation, governance/meta-evaluation). Team Performance mixed high taxonomy/count and description of evaluation practices
n=19
10 practices
0.09
The study confirms several previously documented evaluation challenges with LLMs: model unpredictability, metric mismatch, high human-evaluation costs, and difficulty reproducing failures. Error Rate negative medium-high presence and prevalence of known evaluation challenges
n=19
0.01
Teams often produce evaluation outputs (tests, metrics, user feedback) but lack mechanisms, processes, or technical levers to convert those outputs into actionable engineering or product changes—a novel “results-actionability gap.” Organizational Efficiency negative medium-high ability to translate evaluation outputs into concrete product/engineering changes
n=19
0.01
Successful teams close the results-actionability gap by systematizing interpretive practices and creating clearer pathways from evaluation signals to product changes. Organizational Efficiency positive medium degree to which evaluation leads to implemented product changes
n=19
0.05
Interpretive, ad-hoc human-centered evaluation practices (e.g., “vibe checks”, team sense-making) are rational adaptations to LLM behavior rather than merely sloppy or inferior methodological choices. Team Performance neutral medium characterization of interpretive evaluation practices (rational adaptation vs. methodological failure)
n=19
0.05
Effective teams tend to evolve from ad-hoc interpretive methods toward systematic evaluation by (a) formalizing prompts/tests, (b) instrumenting outputs, (c) mapping failure modes to remediation paths, and (d) creating organizational decision rules. Organizational Efficiency positive medium process maturity in evaluation practices (ad-hoc to systematic)
n=19
0.05
The study method consisted of semi-structured qualitative interviews with 19 practitioners across multiple industries and roles, analyzed via thematic coding. Research Productivity null_result high study design and sample size
n=19
0.09
The paper produces as primary outcomes a taxonomy of ten evaluation practices, the articulation of the results-actionability gap, and recommended strategies observed among successful teams. Research Productivity null_result high reported study outputs (taxonomy, articulated gap, recommended strategies)
n=19
0.09
Measurement friction from the results-actionability gap creates a hidden cost: teams can detect problems but cannot cheaply translate findings into improvements, reducing the speed and ROI of LLM investments. Firm Productivity negative low inferred effect on ROI and speed of product improvement
n=19
0.03
Firms that invest in instrumentation, cross-functional processes, and remediation levers capture more value from LLMs; organizations with better evaluation-to-action pipelines will obtain higher productivity gains and market edge. Firm Productivity positive low relative productivity/value capture tied to evaluation-to-action capability (inferred)
n=19
0.03
The persistence of interpretive, human-in-the-loop evaluation implies ongoing labor requirements (annotation, sense-making, governance roles), affecting forecasts of automation and labor substitution in sectors adopting LLMs. Automation Exposure negative medium continued human labor requirements for evaluation
n=19
0.05
There is demand for tooling that bridges evaluation outputs to actionable fixes (e.g., failure-mode libraries, standardized remediation templates, evaluation-to-priority mapping), signaling economic opportunities for third-party tools and consulting services. Adoption Rate positive low inferred market demand for evaluation-to-action tooling/services
n=19
0.03
Regulators and standard-setters who value transparency and auditability will need to account for the gap between evaluation results and actionable fixes; firms may require incentives or rules to ensure evaluation leads to remediation, not just documentation. Governance And Regulation neutral low policy/regulatory effectiveness regarding evaluation leading to remediation (speculative)
n=19
0.03
The authors propose research priorities for economists: quantify productivity gains from closing the actionability gap; estimate firm-level heterogeneity in evaluation capability and its effect on adoption; and model investment trade-offs between building evaluation-to-action pipelines versus accepting reduced LLM performance. Research Productivity null_result high recommended research agenda topics
0.09

Notes