Structured Intent as a Protocol-Like Communication Layer: Cross-Model Robustness, Framework Comparison, and the Weak-Model Compensation Effect

How reliably can structured intent representations preserve user goals across different AI models, languages, and prompting frameworks? Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese. This paper extends that line of inquiry in three directions: cross-model robustness across Claude, GPT-4o, and Gemini 2.5 Pro; controlled comparison with CO-STAR and RISEN; and a user study (N=50) of AI-assisted intent expansion in ecologically valid settings. Across 3,240 model outputs (3 languages x 6 conditions x 3 models x 3 domains x 20 tasks), evaluated by an independent judge (DeepSeek-V3), we find that structured prompting substantially reduces cross-language score variance relative to unstructured baselines. The strongest structured conditions reduce cross-language sigma from 0.470 to about 0.020. We also observe a weak-model compensation pattern: the lowest-baseline model (Gemini) shows a much larger D-A gain (+1.006) than the strongest model (Claude, +0.217). Under the current evaluation resolution, 5W3H, CO-STAR, and RISEN achieve similarly high goal-alignment scores, suggesting that dimensional decomposition itself is an important active ingredient. In the user study, AI-expanded 5W3H prompts reduce interaction rounds by 60 percent and increase user satisfaction from 3.16 to 4.04. These findings support the practical value of structured intent representation as a robust, protocol-like communication layer for human-AI interaction.

Summary

Main Finding

Structured intent representations (5W3H/PPS and other structured prompting frameworks) act like a protocol-layer for human→AI communication: they substantially and robustly improve goal alignment across models and languages, partially compensate for weaker model capability, and materially reduce user interaction costs when AI assists in expanding intent.

Key Points

Experimental scope: 3 frontier models (Anthropic Claude, OpenAI GPT-4o, Google Gemini 2.5 Pro), 3 languages (ZH/EN/JA), 6 prompt conditions (A: simple, B: raw JSON, C: manual 5W3H, D: AI-expanded 5W3H, E: CO-STAR, F: RISEN), 60 tasks × 3 domains, 3,240 outputs in this study; combined PPS-Bench dataset = 5,400 records total.
Primary quantitative outcomes:
- Structured prompts (C/D/E/F) raise goal-alignment (GA, 1–5) relative to unstructured baselines (A/B). Grand means: A=4.463, B=4.141, C=4.683, D=4.930, E=4.978, F=4.983.
- Cross-language score variance shrinks dramatically under structured conditions (up to ~24× reduction; all-model σ from 0.470 for unstructured → 0.020 under structured).
- Weak-model compensation: weaker baseline model (Gemini) gains +1.006 GA from structured prompting (D vs A) vs. +0.217 for the strongest model (Claude). Gains vary significantly by model.
- Framework comparison: 5W3H (PPS), CO-STAR, and RISEN achieve statistically equivalent high GA at the current evaluation resolution (D=4.930, E=4.978, F=4.983; equivalence margin ±0.2).
- Encoding overhead observed: in at least one condition (GPT-4o, Japanese, D), high-dimensional structured encoding can reduce performance below the unstructured baseline — suggesting an execution-capacity boundary.
User study (N=50, within-subject, real tasks):
- AI-expanded 5W3H reduced median/mean interaction rounds from 4.05 → 1.62 (~60% reduction).
- User satisfaction rose from 3.16 → 4.04 (on 5-point scale).
- 82% of users needed to adjust at most two of the eight 5W3H dimensions after AI expansion.
Practical notes:
- Raw JSON (structure without readable rendering) performed worst overall (B mean = 4.141), indicating readability/format matters.
- PPS includes protocol-like metadata (version, timestamps, Instruction ID, SHA-256 fingerprint) to enable traceability and reproducibility.
- All outputs were judged on goal alignment by an independent judge model (DeepSeek-V3); evaluations used GA only (1–5).

Data & Methods

Tasks & domains: 60 tasks (20 Travel, 20 Business, 20 Technical) designed for substantive generation (500–3,000 words).
Conditions:
- A: Single-sentence prompt (baseline).
- B: Raw JSON PPS (machine-structured, not natural-language).
- C: Manual 5W3H (expert authored).
- D: AI-expanded 5W3H (user What → AI generates remaining 7 dimensions via lateni.com; base expansion model = Qwen-Max).
- E: CO-STAR formatted prompts.
- F: RISEN formatted prompts.
Models: Claude-sonnet-4, GPT-4o, Gemini 2.5 Pro; temperature = 0.0 for all runs.
Languages: Chinese, English, Japanese (parallel tasks).
Evaluation: Single independent judge model (DeepSeek-V3) scored goal alignment (GA) 1–5 and provided reasoning; judge is architecturally independent from test models and expansion model.
Scale: This study produced 3,240 evaluated outputs (3 models × 3 languages × 6 conditions × 3 domains × 20 tasks); PPS-Bench includes earlier papers for 5,400 total records.
Statistical checks: TOST equivalence tests (δ=0.2) used for framework comparisons; Kruskal-Wallis for model-group differences.

Implications for AI Economics

Standardization & interoperability
- Structured intent representations behave like application-layer protocols that reduce output variability across heterogeneous models and languages. This lowers switching costs and frictions when users move between model providers, potentially increasing price competition and commoditization of some output services.
- A standardized intent layer can enable marketplaces for intent-processing tools (intent expanders, validators, converters) that sit above model providers.
Productization & specialization
- Because structured intent reduces reliance on the strongest models for many tasks, firms may shift to cheaper/“weaker” models plus standardized intent tooling. This supports business models that combine modest LLM compute with high-value intent engineering services.
- Conversely, providers of top-tier models may need to emphasize capabilities beyond basic intent-following (e.g., deeper reasoning, grounded knowledge, safety guarantees) to sustain premium pricing.
Labor and productivity effects
- The measured ~60% reduction in interaction rounds and increase in satisfaction suggests direct productivity gains for knowledge workers using AI. Time saved translates to labor cost reductions or higher output per worker, influencing firm-level labor demand and pricing for AI-assisted services.
- The availability of AI-assisted intent expansion lowers the skill/knowledge barrier for effective prompt engineering, broadening adoption among non-expert users and expanding market size for AI tools.
Market structure & complementary goods
- Structured-intent tools (formatters, expanders, intent repositories, verification services) become valuable complements; firms can monetize template libraries, intent-standards compliance, or intent-translation services.
- Certification/benchmarking ecosystems (gold-intent benchmarks, finer-grained alignment metrics) will be needed; ownership of these benchmarks may confer informational rents.
Competition & model valuation
- The weak-model compensation effect implies diminishing marginal value of raw model capability for tasks where structured intent suffices. This could compress willingness to pay differences across model tiers for many grounded content generation tasks.
- However, the observed encoding overhead boundary cautions that high-dimensional intent may still demand sufficient execution capacity; tasks combining complex structure, non-primary languages, or heavy reasoning will preserve a premium for higher-capability models.
Regulatory and standardization implications
- Protocol-like intent standards (with metadata and fingerprints) improve auditability and reproducibility, facilitating regulatory compliance, provenance tracking, and liability assignment — all economically relevant for enterprise adoption.
Research & measurement market
- The paper highlights the need for external gold-intent benchmarks and finer-grained evaluation; the creation and control of such benchmarks will be an economic locus (governing market trust and benchmarking rents).

Overall, structured intent representation is likely to shift some value away from raw model capability toward intent-layer tooling, benchmarks, and protocolization — reshaping product offerings, competitive positioning, and the economics of AI-enabled knowledge work.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — Large synthetic experiment (3,240 model outputs) and a separate user study provide substantial empirical coverage and consistent patterns (reduced cross-language variance, interaction gains), and an independent scoring system improves objectivity; however, dependence on a single automated judge (DeepSeek-V3), limited domain/task set, potential metric tuning to structured formats, unclear randomization and statistical controls, and a modest N=50 user sample limit causal certainty and external validity. Methods Rigormedium — Systematic factorial design across models, languages, and prompting conditions with an independent evaluator and a complementary user study shows methodological care; but rigor is weakened by reliance on an automated evaluator without additional human rater validation reported, limited transparency about task selection and statistical testing, and insufficient detail about user-study design (randomization, participant recruitment, pre-registration). Sample3,240 model outputs: 3 languages (Chinese, English, Japanese) × 6 prompting conditions (including 5W3H, CO-STAR, RISEN, and unstructured baselines) × 3 models (Claude, GPT-4o, Gemini 2.5 Pro) × 3 domains × 20 tasks; evaluations by DeepSeek-V3 automated judge; user study with N=50 participants performing AI-assisted intent expansion in ecologically valid tasks measuring interaction rounds and satisfaction. Themeshuman_ai_collab productivity adoption IdentificationControlled within-task comparisons across prompting conditions (structured vs unstructured and alternative structured frameworks) crossed with language (Chinese, English, Japanese), model (Claude, GPT-4o, Gemini 2.5 Pro), and domain; outcomes scored by an independent automated judge (DeepSeek-V3); user-facing effects measured in an N=50 user study comparing AI-expanded 5W3H prompts to baseline interaction (rounds and satisfaction). No clear randomized assignment or instrumental variables reported; inference rests on factorial control and repeated observations per task. GeneralizabilityResults may not generalize beyond the three proprietary model versions tested (model behavior can change with updates)., Only three languages and three domains were evaluated; other languages, dialects, or task types may behave differently., Reliance on a single automated judge (DeepSeek-V3) risks metric-specific artifacts and may not reflect diverse human judgments., User study (N=50) is modest and may not represent broader populations or workplace contexts., Prompt engineering implementation details (formatting, priming, instruction tuning) could affect replicability across platforms and toolchains.

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese. Output Quality	positive	high	goal alignment (language generalization)	0.48
The study evaluated 3,240 model outputs (3 languages x 6 conditions x 3 models x 3 domains x 20 tasks) using an independent judge (DeepSeek-V3). Other	null_result	high	number of model outputs evaluated / evaluation procedure	n=3240 0.48
Structured prompting substantially reduces cross-language score variance relative to unstructured baselines. Output Quality	positive	high	cross-language score variance (sigma)	n=3240 cross-language sigma reduced from 0.470 to about 0.020 0.48
The strongest structured conditions reduce cross-language sigma from 0.470 to about 0.020. Output Quality	positive	high	cross-language sigma (standard deviation of scores across languages)	n=3240 0.470 -> ~0.020 0.48
A weak-model compensation pattern was observed: the lowest-baseline model (Gemini) shows a much larger D-A gain (+1.006) than the strongest model (Claude, +0.217). Output Quality	positive	high	D-A gain (improvement in goal-alignment score from structured prompting)	n=1080 +1.006 (Gemini) vs +0.217 (Claude) 0.48
Under the current evaluation resolution, 5W3H, CO-STAR, and RISEN achieve similarly high goal-alignment scores, suggesting that dimensional decomposition itself is an important active ingredient. Output Quality	null_result	high	goal-alignment scores	n=3240 0.48
The user study had N=50 participants. Other	null_result	high	user study sample size	n=50 0.48
In the user study, AI-expanded 5W3H prompts reduce interaction rounds by 60 percent. Task Completion Time	positive	high	interaction rounds (number of back-and-forth interactions to reach goal)	n=50 60 percent reduction 0.48
In the user study, AI-expanded 5W3H prompts increase user satisfaction from 3.16 to 4.04. Worker Satisfaction	positive	high	user satisfaction (rating scale)	n=50 3.16 -> 4.04 0.48
These findings support the practical value of structured intent representation as a robust, protocol-like communication layer for human-AI interaction. Organizational Efficiency	positive	medium	practical utility / robustness of structured intent representations	n=3240 0.29