A specialist virtual sales host, VerbalValue, fine-tuned on product knowledge and annotated live-commerce dialogs, outperforms leading LLMs—raising informativeness by 23% and factual accuracy by 18%—and produces more tactful, engaging responses. The results are promising for automated live commerce but rest on a small annotated dataset and proxy engagement metrics rather than measured purchase conversion.

VerbalValue: A Socially Intelligent Virtual Host for Sales-Driven Live Commerce

Yuyan Chen · May 14, 2026

arxiv quasi_experimental medium evidence 7/10 relevance Source PDF

Fine-tuning an LLM with a product knowledge base, sales terminology, and 1,475 annotated live-commerce interactions (VerbalValue) improves informativeness, factual correctness, tactfulness, and viewer engagement relative to top LLM baselines.

A skilled live-commerce host is not merely a narrator, but a sales agent who converts viewer curiosity into purchase intent through expert product knowledge, emotionally intelligent response tactics, and entertainment that serves as a vehicle for product exposure. Yet no existing AI system replicates this: conversational recommenders treat recommendation as a terminal act, while general-purpose LLMs hallucinate product claims and default to generic promotional templates that fail to engage or persuade. We present VerbalValue, a sales-conversion-oriented virtual host that turns exceptional verbal ability into real commercial value, built on three contributions. First, we construct a domain knowledge base of product specifications and a curated sales terminology lexicon that anchor product-related responses in verified expertise. Second, we collect and annotate 1,475 live-commerce interactions spanning diverse viewer intents. Third, we fine-tune a large language model on this data to deliver empathetic, commercially oriented responses, adapting to viewer intent through empathetic amplification, evidence-backed rebuttal, and humor-mediated deflection. Experiments against GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and other baselines demonstrate gains of 23% on informativeness and 18% on factual correctness, with consistent advantages in tactfulness and viewer engagement.

Summary

Main Finding

VerbalValue is a purpose-built virtual host for live-commerce that combines a product-anchored knowledge base, intent-conditioned fine-tuning, and a dual-channel real-time architecture to operate as a sales agent rather than a generic conversationalist. On a Chinese beauty live-commerce test set it improves catalogue-grounded informativeness by ~23% and factual correctness by ~18% (relative to the strongest baseline), while also raising tactful, intent-appropriate responses and engagement.

Key Points

Objective: Reframe live-stream hosting as sales conversion requiring emotional intelligence, factual grounding, and session-level pacing.
Three core challenges addressed:
- Continuous narration vs. interrupt-driven responses (solved via a dual-channel architecture: idle pitch + interruptible interactive channel).
- Sales expertise vs. broadcast fluency (solved via a structured product KB + ingredient glossary to constrain hallucination).
- Socially intelligent tactics vs. efficient training (solved via intent-conditioned supervision with four discourse strategies).
Discrete output schema: each response produces four supervised fields—spoken broadcast (≤2 sentences), a short display slogan (8–12 Chinese characters), a follow-up engagement question, and a CTA—so the model jointly learns persuasion, captioning, and conversion cues.
Quantitative gains (human-judged, 1–5 Likert; Krippendorff’s α > 0.7):
- Informativeness: VerbalValue 4.32 vs best baseline 3.51 → +23.08% absolute gain over best baseline.
- Correctness (no out-of-catalogue claims): 0.86 vs best baseline 0.73 → +17.8% relative.
- Tactfulness: +4.22% over best baseline; Fluency decreased vs. Claude Sonnet (−5.9%), reflecting a trade-off between expressiveness and factual grounding.
Ablations show product-context injection and intent tags are primary drivers of correctness and tactfulness respectively.

Data & Methods

Domain and scope:
- Vertical: Chinese beauty live-commerce (China is a mature live-commerce market; used as study bed).
- Product catalogue: 12 skincare items (cleansers, serums, moisturizers, sunscreens).
- Ingredient glossary: 23 ingredient names with neutral, non-pharmacological descriptions to bound permissible claims.
- Pitch scripts: 180–240 Chinese characters per product following hook→explain→guide→close arc.
Fine-tuning dataset:
- Size: 1,475 annotated instances (50% real comments from public broadcasts, 50% style-matched synthetic comments generated by GPT-5.2).
- Intent label distribution: Inquiry (40%), Scepticism (20%), Appreciation (20%), Antagonism (20%).
- Each instance: system prompt + intent-tagged comment + target response in four-field schema.
- Rigorous cleaning and human verification: three annotators confirm intent labels; mean naturalness >4.5 required; duplicate/PII/incoherent filters applied.
Model and training:
- Backbone: Qwen2.5-32B-Instruct.
- Adaptation: LoRA applied to linear layers (rank 8, scaling 32).
- Training: 20 epochs, batch size 32, lr = 1e-4, bfloat16, max len 2048.
- Inference: nucleus sampling (temp 0.9, top-p 0.92), repetition penalty 1.12, generate 6 candidates → rerank by product alignment, anti-repetition, style freshness → pick top candidate.
Architecture:
- Dual-channel dialogue service with a five-stage session state machine that arbitrates a single audio resource: idle channel cycles pitch scripts; interactive channel preempts to answer comments and resumes from sentence boundary.
- Media and TTS decoupled from generation for latency isolation; sentence-level TTS for interruptible playback.
Baselines & evaluations:
- Baselines: GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, plus Qwen2.5-32B without fine-tuning.
- Human evaluation dimensions: Informativeness, Relevance, Fluency, Tactfulness, Correctness.
- LLM-as-judge (Qwen2.5-72B) for Creativity and Engagement.

Implications for AI Economics

Value capture & revenue effects:
- Higher catalogue-grounded informativeness and correctness plausibly translate to higher conversion rates and lower post-sale friction (returns, complaints). Given live-commerce conversion multiples vs. e-commerce, even modest lift can have large revenue effects per stream.
- Structured CTA + engagement hooks aim to increase viewer retention and click-throughs; measurable monetizable KPIs (conversion rate, average order value, session length) are the right economic metrics to evaluate ROI.
Labor market and task reallocation:
- Automation potential: systems like VerbalValue can substitute part of routine hosting tasks—continuous narration, FAQ answering, and standard persuasion—reducing demand for entry-level or scripted-host labor.
- Complementarity: human hosts may shift to high-variance tasks where creativity, negotiation, and complex trust-building matter (celeb endorsements, conflict resolution, regulatory-sensitive claims)—increasing skill premium for high-touch roles (creative director, compliance overseer).
- New roles: content curators, product-KB maintainers, intent-label designers, and safety/compliance auditors to keep catalogue and glossaries up to date.
Fixed vs. variable costs and scaling:
- Upfront costs: building product KBs, curated scripts, and intent-labeled datasets is resource-intensive per vertical and product family, creating barriers to rapid horizontal scaling across categories.
- Marginal costs: once a KB and fine-tuned model exist, marginal streaming costs are mostly inference, media hosting, and TTS—enabling high operating leverage for platforms with many sessions.
- Firms that internalize large, high-quality product KBs and fine-tuning pipelines may enjoy economies of scale and scope (a potential competitive moat).
Market structure and competition:
- Potential concentration: proprietary KBs + fine-tuning expertise could concentrate power among a few technology/platform providers and large retailers who can amortize upfront costs.
- Platform strategies: marketplaces might offer AI-host-as-a-service for SMB sellers, changing how discovery and promotion are priced and monetized.
Measurement and metric alignment:
- The paper highlights misalignment of standard NLP metrics (fluency, surface-similarity) with commercial objectives (conversion, trust). AI economics evaluations should prioritize downstream economic outcomes over proxy linguistic metrics.
Regulatory, trust, and compliance economics:
- Grounding generation to catalog and a controlled ingredient glossary reduces legal/regulatory risk (false claims, health misadvice), lowering expected compliance costs.
- Disclosure of AI-host identity (recommended by authors) affects consumer trust and may alter conversion elasticity; regulators may mandate disclosure, affecting adoption dynamics.
Externalities & risks:
- Hallucination risk in unconstrained LLMs imposes negative externalities (consumer harm, reputational loss). Grounding and glossary constraints mitigate but require ongoing maintenance; failure to maintain accurate KBs could create liabilities.
- Use of synthetic data (GPT-generated comments) in training accelerates dataset construction but can bias style and risk overfitting to synthetic patterns—potentially affecting real-world efficacy and long-run consumer response.
Generalizability & limits:
- Results are demonstrated for a small (12-product) beauty catalogue in Chinese live-commerce. Economic effects may differ in other verticals (electronics, fashion) with different knowledge complexity, return rates, and regulatory regimes.
- Market maturity matters: China’s high live-commerce penetration makes adoption effects larger there than in markets where live commerce is nascent.
Evaluation recommendations for economists and practitioners:
- Measure conversion lift, repeat purchase, viewer retention, and complaint/return rates in A/B or randomized trials rather than relying solely on human/LLM proxy metrics.
- Account for maintenance costs of KBs and legal/regulatory compliance when projecting ROI.
- Monitor distributional impacts on host labor and consider transitional policies (retraining, certification for AI-augmented hosts).

Limitations noted by the authors (relevant for economic interpretation): small curated catalogue, synthetic augmentation bias, language/market specificity (Chinese beauty), trade-off between fluency and factual grounding, and the need for disclosure/ethical deployment.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper reports quantitative gains (23% informativeness, 18% factual correctness) from controlled model comparisons against multiple strong baselines and uses a curated annotated dataset; however the dataset is small (1,475 interactions), evaluation appears to rely on proxy metrics and human ratings rather than hard commercial outcomes (e.g., conversion/revenue), details on randomization, statistical significance, and potential annotation bias are limited, and comparisons involve proprietary baselines that may be hard to replicate. Methods Rigormedium — Strengths include construction of a domain knowledge base, a curated terminology lexicon, and a labeled live-commerce dataset plus direct fine-tuning of an LLM with multi-dimensional evaluation; weaknesses are modest sample size, limited transparency about annotation procedures, rater training, inter-rater reliability, evaluation protocol, statistical tests, ablations, and external validity checks (e.g., field A/B tests or purchase-conversion measures). SampleA domain-specific product knowledge base and curated sales terminology lexicon plus an annotated dataset of 1,475 live-commerce interactions covering diverse viewer intents; evaluated against several proprietary LLM baselines (GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, etc.) using metrics such as informativeness, factual correctness, tactfulness, and viewer engagement (likely via human raters and automated measures). Themeshuman_ai_collab adoption GeneralizabilitySmall and potentially non-representative dataset (1,475 interactions) limits statistical power and coverage of product categories, Unclear diversity of languages, cultural contexts, and viewer demographics—may not generalize across regions, Evaluations use proxy metrics (informativeness, factual correctness, engagement) rather than measured sales/conversion or revenue impact, Performance compared to specific proprietary model versions—results may change with updated baselines or deployment constraints, Potential annotation/rater bias and lack of reported inter-rater reliability reduce confidence in generalizability, Live-commerce format may not transfer to other sales channels (e.g., text chat, email, brick-and-mortar)

Claims (9)

Claim	Direction	Confidence	Outcome	Details
A skilled live-commerce host is not merely a narrator, but a sales agent who converts viewer curiosity into purchase intent through expert product knowledge, emotionally intelligent response tactics, and entertainment that serves as a vehicle for product exposure. Firm Revenue	positive	high	purchase intent / sales conversion	0.08
No existing AI system replicates this: conversational recommenders treat recommendation as a terminal act, while general-purpose LLMs hallucinate product claims and default to generic promotional templates that fail to engage or persuade. Output Quality	negative	high	quality of recommendations / engagement and persuasion	0.24
We construct a domain knowledge base of product specifications and a curated sales terminology lexicon that anchor product-related responses in verified expertise. Other	positive	high	availability of domain knowledge and sales lexicon (artifact creation)	0.48
We collect and annotate 1,475 live-commerce interactions spanning diverse viewer intents. Other	positive	high	size of annotated dataset	n=1475 0.8
We fine-tune a large language model on this data to deliver empathetic, commercially oriented responses, adapting to viewer intent through empathetic amplification, evidence-backed rebuttal, and humor-mediated deflection. Other	positive	high	ability to produce empathetic, commercially oriented responses	0.48
Experiments against GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and other baselines demonstrate gains of 23% on informativeness. Output Quality	positive	high	informativeness	23% on informativeness 0.48
Experiments against GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and other baselines demonstrate gains of 18% on factual correctness. Output Quality	positive	high	factual correctness	18% on factual correctness 0.48
Experiments show consistent advantages in tactfulness. Output Quality	positive	high	tactfulness	0.48
Experiments show consistent advantages in viewer engagement. Adoption Rate	positive	high	viewer engagement	0.48