A new 67.5-hour benchmark finds current multimodal models struggle to infer users' GUI behavior and intent — scoring about 44.6% on behavior-state detection and 55.0% on help prediction — yet supplying structured user context can raise help-prediction accuracy by up to 50.2 percentage points.

GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks

Saelyne Yang, Jaesang Yu, Yi-Hao Peng, Kevin Qinghong Lin, Jae Won Cho, Yale Song, Juho Kim · March 26, 2026

arxiv descriptive n/a evidence 7/10 relevance Source PDF

GUIDE is a 67.5-hour multimodal benchmark for GUI intent understanding that shows state-of-the-art models perform poorly at detecting behavior and predicting help (≈44.6% and 55.0% accuracy), but benefit substantially when given structured user context.

Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop). While prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency. To move beyond automation and toward collaboration, GUI agents must understand what users are doing and why. We introduce GUIDE (GUI User Intent Detection Evaluation), a benchmark that evaluates AI models on their ability to perceive user behavior, infer intent, and provide assistance in open-ended GUI tasks. GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software. GUIDE defines three tasks - (i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model's ability to recognize behavior state, reason about goals, and decide when and how to help. Evaluations across eight state-of-the-art multimodal models reveal that all models struggled, achieving only 44.6% and 55.0% accuracy on behavior state and help prediction. However, providing user context significantly improved the performance, raising help prediction by up to 50.2pp, highlighting the critical role of structured user understanding in effective assistance. Our dataset is available at https://guide-bench.github.io.

Summary

Main Finding

GUIDE introduces a new benchmark and dataset (67.5 hours of screen recordings from 120 novice demonstrations across 10 applications) to evaluate multimodal models on user-centered, open-ended GUI assistance. Current state-of-the-art multimodal LLMs struggle to infer user behavior, intent, and assistance needs from visual GUI traces alone (behavior-state detection ≈ 44.6% accuracy; help prediction ≈ 55.0% accuracy). However, providing structured, human-grounded user context (e.g., inferred behavior state and intent, temporal history) substantially improves assistance decisions — in some cases boosting help-prediction accuracy by up to ~50 percentage points — indicating that structured user understanding is critical for effective GUI assistance.

Key Points

Purpose: Move beyond full automation to collaborative GUI agents that understand what users are doing and why (preserve agency while assisting).
Dataset (GUIDE):
- 67.5 hours of screen recordings from 120 novice demonstrations.
- 10 widely used applications (Photoshop, Figma, PowerPoint, Premiere Pro, Excel, etc.), 40 open-ended tasks.
- Think-aloud narrations recorded for annotation purposes; evaluation is vision-only (simulating real-world where speech may be unavailable).
- Collected from 54 novice participants (each task completed by three different users).
Tasks (three-stage evaluation pipeline):
Behavior State Detection — classify a video segment into one of 9 behavior states (taxonomy grouped into 4 phases: Planning, Execution, Problem-Solving, Evaluation).
Intent Prediction — infer the user’s short-term, immediate goal.
Help Prediction — (a) binary Help-Need Detection and (b) Help-Content Prediction (type of assistance).
Annotation pipeline:
- Transcriptions via WhisperX; initial annotations and candidate labels generated with Gemini-2.5-Pro; human verification/refinement.
- Quality: behavior-state annotation inter-annotator agreement ≈ 96.1%.
- Final sizes: behavior dataset balanced with 1.8K segments (200 per class), intent dataset ≈ 1.3K validated instances, help dataset ≈ 1.0K validated instances (≈66% labeled as help-needed).
Baseline evaluation:
- Eight zero-shot multimodal models evaluated: Gemini-2.5-Flash, Gemini-2.5-Pro, GPT-4o-mini, GPT-4o, Claude-4.5-Sonnet, Qwen3-VL-8B, InternVideo2.5-8B, InternVL3-8B.
- Visual-only input: 32 frames sampled per test segment; no narration provided at test time.
- Results: models performed poorly on behavior-state and help prediction tasks (behavior detection average ~44.6%, help prediction average ~55.0%), with wide inter-model variability. Providing structured context (previous behavior, inferred intent) greatly improved help prediction performance for all models.
Dataset and benchmark are publicly available: https://guide-bench.github.io

Data & Methods

Data collection:
- Participants: 54 novices recruited (screened for low-to-moderate expertise).
- Each participant recorded screen, mouse/keyboard events, and think-aloud narration while completing open-ended tasks; tasks designed to encourage exploration, failure, iteration.
- Applications: 10 apps across photo editing, graphic design, presentations, video editing, data analysis.
Annotation and dataset creation:
- Transcription: WhisperX used to transcribe think-aloud audio.
- Human-AI collaborative labeling: Gemini-2.5-Pro generated initial labels (behavior states, intents, help signals and distractors), which authors and human annotators reviewed.
- Taxonomy: 9 behavior states derived jointly by authors and LLM, validated on dataset and mapped to cognitive/interaction theory (e.g., Norman’s stages, Bloom’s taxonomy).
- Data balancing & quality control: sampled uniformly to balance behavior classes (200 each), reduced/cleaned instances for intent/help tasks; retained proportions: intent ≈ 88.7% of initial, help ≈ 78.9% of initial; some segments trimmed to remove explicit external help signals.
Experimental protocol:
- Zero-shot evaluation of eight MLLMs via APIs or checkpoints (no fine-tuning).
- Visual input only: models see sampled frames from segments (no audio).
- Evaluated on accuracy for behavior-state, multiple-choice intent prediction, binary help-need and multiple-choice help-content prediction.
- Ablations tested: adding previous context frames, or providing model with behavior/intent annotations to measure sensitivity to structured user context.

Implications for AI Economics

Value proposition and productivity:
- High potential economic value: context-aware GUI assistants that correctly infer intent and need can reduce task friction, shorten learning curves for novices, and improve creative/analytical productivity in software suites used widely in knowledge-work industries.
- Heterogeneous gains: novices (the dataset focus) likely experience larger relative gains from assistive support than experts; quantifying time saved and quality improvements will determine ROI for firms bundling such assistants.
Cost of errors and trust:
- Current model brittleness implies nontrivial risks: erroneous or mistimed suggestions can degrade productivity or interrupt creative workflows, reducing user trust and adoption — a direct economic cost for providers.
- Investment required in safer, context-aware models or human-in-the-loop designs to mitigate negative externalities; these investments affect development cost and pricing strategies.
Product design and business models:
- The results suggest firms should favor layered assistance (infer behavior/intent first, then offer suggestions) rather than opaque automation. This supports subscription or premium features for proactive, explainable assistants that maintain user control.
- Structured user-context representations (behavior states, short-term intent) are commercially valuable assets; companies that can reliably infer and store such context can offer higher-value personalization and targeted upsells.
Labor and skill impacts:
- Augmentation > replacement: for exploratory, creative work, assistants that preserve agency are more likely to augment human labor (increase productivity) than to replace workers, shifting labor demand toward higher-level creative and evaluative tasks.
- Training and reskilling demand may shift: as assistants handle routine GUI subtasks, demand for users skilled in supervising and leveraging these tools (prompting, reviewing suggestions) may rise.
Measurement and evaluation considerations:
- Economic evaluation of such assistants should measure not just time savings but also effects on creativity, satisfaction, error rates, and rework — all of which have welfare implications beyond simple productivity metrics.
- Because providing structured context boosts effectiveness dramatically, ROI analyses should include costs of collecting, annotating, and maintaining user-context signals (e.g., lightweight telemetry, opt-in provenance) and the privacy/regulatory constraints tied to them.
Market and regulatory considerations:
- Privacy and consent: collecting think-aloud or fine-grained interaction data raises privacy costs and compliance requirements; economic models must account for data governance overhead.
- Competitive advantage: firms that acquire rich datasets of novice workflows and build robust intent-inference will gain a differentiation edge; however, this creates potential lock-in and market-power concerns if access to proprietary datasets is key to assistant quality.

Overall, GUIDE highlights both the promise and current limitations of deploying multimodal, user-aware GUI assistants. From an economic perspective, the key levers for commercial success are reliable, structured inference of user state/intent, careful UX that preserves agency, and governance of interaction data — all of which affect costs, adoption, and the nature of labor augmentation in knowledge-work markets.

Assessment

Paper Typedescriptive Evidence Strengthn/a — This is a dataset/benchmark and model-evaluation paper rather than an empirical causal study; it does not attempt causal identification of economic effects. Methods Rigormedium — The paper assembles a sizeable multimodal dataset (67.5 hours, 120 novice users, think-aloud narrations) across 10 applications and defines three task labels with evaluations on eight state-of-the-art models, which is methodologically solid for a benchmark; however, there are limits including potential annotation subjectivity, novice-only participants, limited ecological validity of recorded tasks, modest sample size for some variation analyses, and unclear diversity across users and task scenarios. Sample67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations across 10 different desktop software applications; annotated for three tasks (Behavior State Detection, Intent Prediction, Help Prediction); models evaluated are eight contemporary multimodal architectures. Themeshuman_ai_collab productivity GeneralizabilityParticipants are novices only — results may not generalize to expert users, Ten software packages cover many but not all GUI types and industries, Laboratory-recorded/demo tasks with think-aloud may change natural behavior (reactivity), Sample size and demographic diversity not reported in detail, limiting population-level generalization, Evaluations limited to the specific set of multimodal models and prompts tested — future models may perform differently

Claims (8)

Claim	Direction	Confidence	Outcome	Details
GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software. Research Productivity	positive	high	dataset size and composition (hours, number of demonstrations, software covered)	n=120 67.5 hours; 120 demonstrations; 10 software 0.3
GUIDE defines three tasks - (i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model's ability to recognize behavior state, reason about goals, and decide when and how to help. Decision Quality	positive	high	task definitions evaluating model capabilities (behavior detection, intent prediction, help prediction)	0.3
Evaluations across eight state-of-the-art multimodal models reveal that models achieved only 44.6% accuracy on behavior state detection. Decision Quality	negative	high	behavior state detection accuracy	44.6% accuracy 0.18
Evaluations across eight state-of-the-art multimodal models reveal that models achieved only 55.0% accuracy on help prediction. Decision Quality	negative	high	help prediction accuracy	55.0% accuracy 0.18
Providing user context significantly improved the performance, raising help prediction by up to 50.2pp. Decision Quality	positive	high	improvement in help prediction accuracy when user context is provided	up to 50.2pp 0.18
Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop). Organizational Efficiency	positive	high	potential for GUI agents to assist users	0.09
Prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency. Worker Satisfaction	mixed	medium	alignment of prior research focus with user values (automation vs. intention-preserving collaboration)	0.05
Our dataset is available at https://guide-bench.github.io. Research Productivity	positive	high	dataset availability / accessibility	https://guide-bench.github.io 0.3