Letting users mark constraints as hard rules or soft preferences — and verifying each with a matching technique — makes LLM-generated plans more reliable and usable; in lab studies U-Define improved task success, perceived usefulness, and satisfaction compared with prior approaches.

U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning

Christine P Lee, Xinyu Jessica Wang, Aws Albarghouthi, David Porfirio, Bilge Mutlu · May 04, 2026

arxiv quasi_experimental medium evidence 7/10 relevance Source PDF

Labeling user constraints as hard rules or soft preferences and verifying them with complementary methods (formal model checking for hard constraints, LLM-as-judge for soft ones) improves task performance, perceived usefulness, and user satisfaction in technical and user-study evaluations.

LLMs are increasingly used for end-user task planning, yet their black-box nature limits users' ability to ensure reliability and control. While recent systems incorporate verification techniques, it remains unclear how users can effectively apply such rigid constraints to represent intent or adapt to real-world variability. For example, prior work finds that hard-only constraints are too rigid, and numeric flexibility weights confuse users. We investigate how interaction workflows can better support users in applying constraints to guide LLM-generated plans, examining whether abstracting strictness into high-level types (i.e., hard and soft) paired with distinct verification mechanisms helps users more reliably express and align intent. We present U-Define, a system that lets users define constraints in natural language and categorize them as either hard rules that must not be violated or soft preferences that allow flexibility. U-Define verifies these types through complementary methods: formal model checking for hard constraints and LLM-as-judge evaluation for soft ones. Through a technical evaluation and user studies with general and expert participants, we find that user-defined constraint types improve perceived usefulness, performance, and satisfaction while maintaining usability. These findings provide insights for designing flexible yet reliable constraint-based workflows.

Summary

Main Finding

U-Define is a workflow and prototype system that lets end users author natural-language constraints and label them as hard (must never be violated) or soft (preferences that may be flex). By verifying hard constraints with formal model checking and soft constraints via LLM-as-judge evaluation, U-Define combines deterministic guarantees with flexible LLM planning. Across a technical evaluation and user studies (general users + domain experts), user-defined constraint types improved perceived usefulness, plan quality, and satisfaction while maintaining usability.

Key Points

Problem addressed: LLMs are accessible planners but unreliable and opaque; prior verification approaches are either rigid (hard-only) or confusing (numeric weights).
Design idea: expose a simple, high-level strictness distinction—hard vs. soft—and pair each type with a verification method suited to its semantics.
- Hard constraints → translated to formal specifications (e.g., temporal/logical properties) and checked by model checking (deterministic guarantees).
- Soft constraints → evaluated by LLM-as-judge (graded, flexible assessments of preference satisfaction).
User workflow: users include natural-language constraints with their planning prompt, mark each as hard or soft, receive verification results (met/violated/degree), and iteratively refine or relax constraints.
Technical contribution: an automated pipeline that translates natural-language hard constraints into verifiable formal artifacts using LLMs (reducing the need for domain-specific templates).
Empirical findings: participants treated hard and soft constraints as distinct and useful; combined use raised perceived plan quality and satisfaction versus baseline/rigid alternatives without degrading usability.
Practical benefits: enables reliable enforcement where needed (safety/legal/strict rules) and flexible personalization elsewhere, aligning verification fidelity with user intent.

Data & Methods

System/prototype: U-Define prototype that (1) accepts free-form natural language constraints, (2) asks users to tag hard vs. soft, (3) invokes an LLM translation step to produce formal specifications for hard constraints, (4) runs model checking on generated plans for hard constraints, and (5) runs LLM-based judgment for soft constraints; presents verification outcomes and allows iterative edits.
Technical evaluation: assessed the automated translation/verification pipeline (quality/coverage of NL→formal specs, ability to detect hard-constraint violations) and the behavior of the two verification paths. (Paper reports a technical evaluation but specifics of metrics/benchmarks are in full text.)
User studies:
- General user study: recruited non-expert end users to perform representative planning tasks with U-Define; measured perceived usefulness, plan performance, satisfaction, and usability.
- Expert study: domain experts used the system on domain-relevant planning tasks to evaluate practical utility and trust in hard constraints.
Outcome measures: subjective ratings (usefulness, satisfaction), usability assessments, and task-level performance (i.e., whether produced plans satisfied users’ constraints as indicated by the verification pipeline). The studies showed improved perceptions and plan alignment when users could define constraint types.
Limitations noted by authors: reliance on LLM translation accuracy for formalization, potential biases/limitations of LLM-as-judge evaluations, and remaining challenges in mapping nuanced NL constraints to formal properties.

Implications for AI Economics

Value of hybrid verification services: Systems that combine deterministic verification for strict rules with generative flexibility for preferences create new product niches (enterprise planning tools, compliance-aware AI assistants). Firms may be willing to pay premiums for verifiable guarantees on critical constraints (safety, legal, SLAs).
Labor and task allocation: By shifting routine verification of soft preferences to LLM-as-judge and hard-rule enforcement to automated model checking, U-Define-like workflows can reduce human review costs for many planning tasks while preserving human oversight for corner cases—changing where and how humans are paid to intervene.
Risk management and liability: Deterministic checking of hard constraints lowers the probability of costly violations, reducing firms’ operational risk and potentially lowering insurance/compliance costs. This may accelerate adoption of LLM-based planning in regulated domains.
Market for specialized tooling and expertise: Demand will grow for tools that (a) robustly translate NL constraints into formal specs, (b) integrate model checking into user-facing UIs, and (c) audit LLM-as-judge performance. This raises value for firms offering combined verification+LLM platforms and for consultants who craft constraint libraries and mappings.
Pricing and product differentiation: Vendors can tier services—basic LLM planning for low-cost usage, plus paid guarantees (hard-constraint verification, audit logs, certified translations) for higher-priced, higher-assurance plans.
Productivity and adoption trade-offs: Greater perceived usefulness and satisfaction can increase adoption of LLM planning tools, improving productivity. However, translation/verification costs and residual error risks (from mis-translated constraints or biased LLM judgments) create frictions that affect ROI calculations for different adopters and tasks.
Research & regulatory implications: Policymakers and auditors may begin to prefer or require verifiable hard-constraint enforcement in certain industries; economic incentives will push vendors to standardize formal-specification interfaces and verification certifications.

Potential follow-ups for economic analysis: quantify time/cost savings from reduced manual verification, estimate willingness-to-pay for hard-constraint guarantees across sectors, and model the market impact of verification-as-a-service bundled with LLM planning.

If you want, I can extract likely limitations and open research questions in more detail, or sketch an economic model estimating adoption/payoff for firms using U-Define–style systems.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper provides experimental user-study evidence and a technical evaluation showing consistent improvements, which supports causal interpretation at the task/interface level; however, sample sizes, participant recruitment details, real-world deployment, and long-term effects are not provided (or are limited), constraining confidence and external validity. Methods Rigormedium — The study uses a mixed approach (technical benchmarks + user studies with both general and expert participants) and compares clear verification methods, which demonstrates methodological breadth; but potential weaknesses include unclear randomization/preregistration, likely small or convenience samples, short tasks in lab settings, possible learning or order effects, and dependence on specific LLMs and task types. SampleTechnical evaluation on benchmark planning/constraint tasks comparing formal model-checking and LLM-judge verification; user studies with both general participants and expert participants performing end-user planning tasks under different interface/workflow conditions (numbers, recruitment channels, and exact demographics not specified in the summary). Themeshuman_ai_collab productivity IdentificationControlled user studies comparing the U-Define workflow to alternative/previous workflows (likely between- or within-subjects), combined with a technical evaluation that compares formal model checking vs LLM-as-judge verification; causal claims rest on experimental contrasts in the user studies (objective task performance and subjective ratings) alongside complementary technical benchmarks. GeneralizabilityLab-style user studies may not reflect real-world, long-duration use, Participant pool (general and experts) may be small or non-representative, Results may depend on the specific LLM(s) and models used, Task domains used for evaluation may be narrow and not cover all planning contexts, Effect on organizational outcomes or productivity at scale is not measured, Cultural and language differences in natural-language constraints not addressed

Claims (10)

Claim	Direction	Confidence	Outcome	Details
LLMs are increasingly used for end-user task planning, yet their black-box nature limits users' ability to ensure reliability and control. Other	negative	high	reliability and control over LLM outputs	0.48
Prior work finds that hard-only constraints are too rigid, and numeric flexibility weights confuse users. Other	negative	high	usability of constraint specification (rigidity and understandability of numeric flexibility weights)	0.48
We present U-Define, a system that lets users define constraints in natural language and categorize them as either hard rules that must not be violated or soft preferences that allow flexibility. Other	positive	high	ability to specify constraints (natural-language input and categorization into hard/soft)	0.8
U-Define verifies hard constraints using formal model checking and verifies soft constraints using an LLM-as-judge evaluation. Other	positive	high	verification of constraint types (hard via model checking, soft via LLM evaluation)	0.8
We conducted a technical evaluation and user studies with general and expert participants. Other	null_result	high	conduct of evaluations (technical and user studies)	0.8
User-defined constraint types improve perceived usefulness. Adoption Rate	positive	high	perceived usefulness (user-reported)	0.48
User-defined constraint types improve performance. Output Quality	positive	high	performance (task success / quality of generated plans)	0.48
User-defined constraint types improve user satisfaction. Worker Satisfaction	positive	high	user satisfaction (self-reported)	0.48
User-defined constraint types maintain usability. Organizational Efficiency	null_result	high	usability	0.48
These findings provide insights for designing flexible yet reliable constraint-based workflows. Governance And Regulation	positive	high	design guidance for constraint-based workflows	0.08