Structured intent prompts (5W3H/CO-STAR/RISEN) make AI understanding far more consistent across languages and models and speed up human–AI interactions, with the biggest improvements on weaker models; users needed 60% fewer rounds and reported much higher satisfaction when AI-expanded 5W3H prompts were used.
How reliably can structured intent representations preserve user goals across different AI models, languages, and prompting frameworks? Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese. This paper extends that line of inquiry in three directions: cross-model robustness across Claude, GPT-4o, and Gemini 2.5 Pro; controlled comparison with CO-STAR and RISEN; and a user study (N=50) of AI-assisted intent expansion in ecologically valid settings. Across 3,240 model outputs (3 languages x 6 conditions x 3 models x 3 domains x 20 tasks), evaluated by an independent judge (DeepSeek-V3), we find that structured prompting substantially reduces cross-language score variance relative to unstructured baselines. The strongest structured conditions reduce cross-language sigma from 0.470 to about 0.020. We also observe a weak-model compensation pattern: the lowest-baseline model (Gemini) shows a much larger D-A gain (+1.006) than the strongest model (Claude, +0.217). Under the current evaluation resolution, 5W3H, CO-STAR, and RISEN achieve similarly high goal-alignment scores, suggesting that dimensional decomposition itself is an important active ingredient. In the user study, AI-expanded 5W3H prompts reduce interaction rounds by 60 percent and increase user satisfaction from 3.16 to 4.04. These findings support the practical value of structured intent representation as a robust, protocol-like communication layer for human-AI interaction.
Summary
Main Finding
Structured intent representations (5W3H/PPS and other structured prompting frameworks) act like a protocol-layer for human→AI communication: they substantially and robustly improve goal alignment across models and languages, partially compensate for weaker model capability, and materially reduce user interaction costs when AI assists in expanding intent.
Key Points
- Experimental scope: 3 frontier models (Anthropic Claude, OpenAI GPT-4o, Google Gemini 2.5 Pro), 3 languages (ZH/EN/JA), 6 prompt conditions (A: simple, B: raw JSON, C: manual 5W3H, D: AI-expanded 5W3H, E: CO-STAR, F: RISEN), 60 tasks × 3 domains, 3,240 outputs in this study; combined PPS-Bench dataset = 5,400 records total.
- Primary quantitative outcomes:
- Structured prompts (C/D/E/F) raise goal-alignment (GA, 1–5) relative to unstructured baselines (A/B). Grand means: A=4.463, B=4.141, C=4.683, D=4.930, E=4.978, F=4.983.
- Cross-language score variance shrinks dramatically under structured conditions (up to ~24× reduction; all-model σ from 0.470 for unstructured → 0.020 under structured).
- Weak-model compensation: weaker baseline model (Gemini) gains +1.006 GA from structured prompting (D vs A) vs. +0.217 for the strongest model (Claude). Gains vary significantly by model.
- Framework comparison: 5W3H (PPS), CO-STAR, and RISEN achieve statistically equivalent high GA at the current evaluation resolution (D=4.930, E=4.978, F=4.983; equivalence margin ±0.2).
- Encoding overhead observed: in at least one condition (GPT-4o, Japanese, D), high-dimensional structured encoding can reduce performance below the unstructured baseline — suggesting an execution-capacity boundary.
- User study (N=50, within-subject, real tasks):
- AI-expanded 5W3H reduced median/mean interaction rounds from 4.05 → 1.62 (~60% reduction).
- User satisfaction rose from 3.16 → 4.04 (on 5-point scale).
- 82% of users needed to adjust at most two of the eight 5W3H dimensions after AI expansion.
- Practical notes:
- Raw JSON (structure without readable rendering) performed worst overall (B mean = 4.141), indicating readability/format matters.
- PPS includes protocol-like metadata (version, timestamps, Instruction ID, SHA-256 fingerprint) to enable traceability and reproducibility.
- All outputs were judged on goal alignment by an independent judge model (DeepSeek-V3); evaluations used GA only (1–5).
Data & Methods
- Tasks & domains: 60 tasks (20 Travel, 20 Business, 20 Technical) designed for substantive generation (500–3,000 words).
- Conditions:
- A: Single-sentence prompt (baseline).
- B: Raw JSON PPS (machine-structured, not natural-language).
- C: Manual 5W3H (expert authored).
- D: AI-expanded 5W3H (user What → AI generates remaining 7 dimensions via lateni.com; base expansion model = Qwen-Max).
- E: CO-STAR formatted prompts.
- F: RISEN formatted prompts.
- Models: Claude-sonnet-4, GPT-4o, Gemini 2.5 Pro; temperature = 0.0 for all runs.
- Languages: Chinese, English, Japanese (parallel tasks).
- Evaluation: Single independent judge model (DeepSeek-V3) scored goal alignment (GA) 1–5 and provided reasoning; judge is architecturally independent from test models and expansion model.
- Scale: This study produced 3,240 evaluated outputs (3 models × 3 languages × 6 conditions × 3 domains × 20 tasks); PPS-Bench includes earlier papers for 5,400 total records.
- Statistical checks: TOST equivalence tests (δ=0.2) used for framework comparisons; Kruskal-Wallis for model-group differences.
Implications for AI Economics
- Standardization & interoperability
- Structured intent representations behave like application-layer protocols that reduce output variability across heterogeneous models and languages. This lowers switching costs and frictions when users move between model providers, potentially increasing price competition and commoditization of some output services.
- A standardized intent layer can enable marketplaces for intent-processing tools (intent expanders, validators, converters) that sit above model providers.
- Productization & specialization
- Because structured intent reduces reliance on the strongest models for many tasks, firms may shift to cheaper/“weaker” models plus standardized intent tooling. This supports business models that combine modest LLM compute with high-value intent engineering services.
- Conversely, providers of top-tier models may need to emphasize capabilities beyond basic intent-following (e.g., deeper reasoning, grounded knowledge, safety guarantees) to sustain premium pricing.
- Labor and productivity effects
- The measured ~60% reduction in interaction rounds and increase in satisfaction suggests direct productivity gains for knowledge workers using AI. Time saved translates to labor cost reductions or higher output per worker, influencing firm-level labor demand and pricing for AI-assisted services.
- The availability of AI-assisted intent expansion lowers the skill/knowledge barrier for effective prompt engineering, broadening adoption among non-expert users and expanding market size for AI tools.
- Market structure & complementary goods
- Structured-intent tools (formatters, expanders, intent repositories, verification services) become valuable complements; firms can monetize template libraries, intent-standards compliance, or intent-translation services.
- Certification/benchmarking ecosystems (gold-intent benchmarks, finer-grained alignment metrics) will be needed; ownership of these benchmarks may confer informational rents.
- Competition & model valuation
- The weak-model compensation effect implies diminishing marginal value of raw model capability for tasks where structured intent suffices. This could compress willingness to pay differences across model tiers for many grounded content generation tasks.
- However, the observed encoding overhead boundary cautions that high-dimensional intent may still demand sufficient execution capacity; tasks combining complex structure, non-primary languages, or heavy reasoning will preserve a premium for higher-capability models.
- Regulatory and standardization implications
- Protocol-like intent standards (with metadata and fingerprints) improve auditability and reproducibility, facilitating regulatory compliance, provenance tracking, and liability assignment — all economically relevant for enterprise adoption.
- Research & measurement market
- The paper highlights the need for external gold-intent benchmarks and finer-grained evaluation; the creation and control of such benchmarks will be an economic locus (governing market trust and benchmarking rents).
Overall, structured intent representation is likely to shift some value away from raw model capability toward intent-layer tooling, benchmarks, and protocolization — reshaping product offerings, competitive positioning, and the economics of AI-enabled knowledge work.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Prior work showed that PPS (Prompt Protocol Specification), a 5W3H-based structured intent framework, improves goal alignment in Chinese and generalizes to English and Japanese. Output Quality | positive | high | goal alignment (language generalization) |
0.48
|
| The study evaluated 3,240 model outputs (3 languages x 6 conditions x 3 models x 3 domains x 20 tasks) using an independent judge (DeepSeek-V3). Other | null_result | high | number of model outputs evaluated / evaluation procedure |
n=3240
0.48
|
| Structured prompting substantially reduces cross-language score variance relative to unstructured baselines. Output Quality | positive | high | cross-language score variance (sigma) |
n=3240
cross-language sigma reduced from 0.470 to about 0.020
0.48
|
| The strongest structured conditions reduce cross-language sigma from 0.470 to about 0.020. Output Quality | positive | high | cross-language sigma (standard deviation of scores across languages) |
n=3240
0.470 -> ~0.020
0.48
|
| A weak-model compensation pattern was observed: the lowest-baseline model (Gemini) shows a much larger D-A gain (+1.006) than the strongest model (Claude, +0.217). Output Quality | positive | high | D-A gain (improvement in goal-alignment score from structured prompting) |
n=1080
+1.006 (Gemini) vs +0.217 (Claude)
0.48
|
| Under the current evaluation resolution, 5W3H, CO-STAR, and RISEN achieve similarly high goal-alignment scores, suggesting that dimensional decomposition itself is an important active ingredient. Output Quality | null_result | high | goal-alignment scores |
n=3240
0.48
|
| The user study had N=50 participants. Other | null_result | high | user study sample size |
n=50
0.48
|
| In the user study, AI-expanded 5W3H prompts reduce interaction rounds by 60 percent. Task Completion Time | positive | high | interaction rounds (number of back-and-forth interactions to reach goal) |
n=50
60 percent reduction
0.48
|
| In the user study, AI-expanded 5W3H prompts increase user satisfaction from 3.16 to 4.04. Worker Satisfaction | positive | high | user satisfaction (rating scale) |
n=50
3.16 -> 4.04
0.48
|
| These findings support the practical value of structured intent representation as a robust, protocol-like communication layer for human-AI interaction. Organizational Efficiency | positive | medium | practical utility / robustness of structured intent representations |
n=3240
0.29
|