A new 67.5-hour benchmark finds current multimodal models struggle to infer users' GUI behavior and intent — scoring about 44.6% on behavior-state detection and 55.0% on help prediction — yet supplying structured user context can raise help-prediction accuracy by up to 50.2 percentage points.
Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop). While prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency. To move beyond automation and toward collaboration, GUI agents must understand what users are doing and why. We introduce GUIDE (GUI User Intent Detection Evaluation), a benchmark that evaluates AI models on their ability to perceive user behavior, infer intent, and provide assistance in open-ended GUI tasks. GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software. GUIDE defines three tasks - (i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model's ability to recognize behavior state, reason about goals, and decide when and how to help. Evaluations across eight state-of-the-art multimodal models reveal that all models struggled, achieving only 44.6% and 55.0% accuracy on behavior state and help prediction. However, providing user context significantly improved the performance, raising help prediction by up to 50.2pp, highlighting the critical role of structured user understanding in effective assistance. Our dataset is available at https://guide-bench.github.io.
Summary
Main Finding
GUIDE introduces a new benchmark and dataset (67.5 hours of screen recordings from 120 novice demonstrations across 10 applications) to evaluate multimodal models on user-centered, open-ended GUI assistance. Current state-of-the-art multimodal LLMs struggle to infer user behavior, intent, and assistance needs from visual GUI traces alone (behavior-state detection ≈ 44.6% accuracy; help prediction ≈ 55.0% accuracy). However, providing structured, human-grounded user context (e.g., inferred behavior state and intent, temporal history) substantially improves assistance decisions — in some cases boosting help-prediction accuracy by up to ~50 percentage points — indicating that structured user understanding is critical for effective GUI assistance.
Key Points
- Purpose: Move beyond full automation to collaborative GUI agents that understand what users are doing and why (preserve agency while assisting).
- Dataset (GUIDE):
- 67.5 hours of screen recordings from 120 novice demonstrations.
- 10 widely used applications (Photoshop, Figma, PowerPoint, Premiere Pro, Excel, etc.), 40 open-ended tasks.
- Think-aloud narrations recorded for annotation purposes; evaluation is vision-only (simulating real-world where speech may be unavailable).
- Collected from 54 novice participants (each task completed by three different users).
- Tasks (three-stage evaluation pipeline):
- Behavior State Detection — classify a video segment into one of 9 behavior states (taxonomy grouped into 4 phases: Planning, Execution, Problem-Solving, Evaluation).
- Intent Prediction — infer the user’s short-term, immediate goal.
- Help Prediction — (a) binary Help-Need Detection and (b) Help-Content Prediction (type of assistance).
- Annotation pipeline:
- Transcriptions via WhisperX; initial annotations and candidate labels generated with Gemini-2.5-Pro; human verification/refinement.
- Quality: behavior-state annotation inter-annotator agreement ≈ 96.1%.
- Final sizes: behavior dataset balanced with 1.8K segments (200 per class), intent dataset ≈ 1.3K validated instances, help dataset ≈ 1.0K validated instances (≈66% labeled as help-needed).
- Baseline evaluation:
- Eight zero-shot multimodal models evaluated: Gemini-2.5-Flash, Gemini-2.5-Pro, GPT-4o-mini, GPT-4o, Claude-4.5-Sonnet, Qwen3-VL-8B, InternVideo2.5-8B, InternVL3-8B.
- Visual-only input: 32 frames sampled per test segment; no narration provided at test time.
- Results: models performed poorly on behavior-state and help prediction tasks (behavior detection average ~44.6%, help prediction average ~55.0%), with wide inter-model variability. Providing structured context (previous behavior, inferred intent) greatly improved help prediction performance for all models.
- Dataset and benchmark are publicly available: https://guide-bench.github.io
Data & Methods
- Data collection:
- Participants: 54 novices recruited (screened for low-to-moderate expertise).
- Each participant recorded screen, mouse/keyboard events, and think-aloud narration while completing open-ended tasks; tasks designed to encourage exploration, failure, iteration.
- Applications: 10 apps across photo editing, graphic design, presentations, video editing, data analysis.
- Annotation and dataset creation:
- Transcription: WhisperX used to transcribe think-aloud audio.
- Human-AI collaborative labeling: Gemini-2.5-Pro generated initial labels (behavior states, intents, help signals and distractors), which authors and human annotators reviewed.
- Taxonomy: 9 behavior states derived jointly by authors and LLM, validated on dataset and mapped to cognitive/interaction theory (e.g., Norman’s stages, Bloom’s taxonomy).
- Data balancing & quality control: sampled uniformly to balance behavior classes (200 each), reduced/cleaned instances for intent/help tasks; retained proportions: intent ≈ 88.7% of initial, help ≈ 78.9% of initial; some segments trimmed to remove explicit external help signals.
- Experimental protocol:
- Zero-shot evaluation of eight MLLMs via APIs or checkpoints (no fine-tuning).
- Visual input only: models see sampled frames from segments (no audio).
- Evaluated on accuracy for behavior-state, multiple-choice intent prediction, binary help-need and multiple-choice help-content prediction.
- Ablations tested: adding previous context frames, or providing model with behavior/intent annotations to measure sensitivity to structured user context.
Implications for AI Economics
- Value proposition and productivity:
- High potential economic value: context-aware GUI assistants that correctly infer intent and need can reduce task friction, shorten learning curves for novices, and improve creative/analytical productivity in software suites used widely in knowledge-work industries.
- Heterogeneous gains: novices (the dataset focus) likely experience larger relative gains from assistive support than experts; quantifying time saved and quality improvements will determine ROI for firms bundling such assistants.
- Cost of errors and trust:
- Current model brittleness implies nontrivial risks: erroneous or mistimed suggestions can degrade productivity or interrupt creative workflows, reducing user trust and adoption — a direct economic cost for providers.
- Investment required in safer, context-aware models or human-in-the-loop designs to mitigate negative externalities; these investments affect development cost and pricing strategies.
- Product design and business models:
- The results suggest firms should favor layered assistance (infer behavior/intent first, then offer suggestions) rather than opaque automation. This supports subscription or premium features for proactive, explainable assistants that maintain user control.
- Structured user-context representations (behavior states, short-term intent) are commercially valuable assets; companies that can reliably infer and store such context can offer higher-value personalization and targeted upsells.
- Labor and skill impacts:
- Augmentation > replacement: for exploratory, creative work, assistants that preserve agency are more likely to augment human labor (increase productivity) than to replace workers, shifting labor demand toward higher-level creative and evaluative tasks.
- Training and reskilling demand may shift: as assistants handle routine GUI subtasks, demand for users skilled in supervising and leveraging these tools (prompting, reviewing suggestions) may rise.
- Measurement and evaluation considerations:
- Economic evaluation of such assistants should measure not just time savings but also effects on creativity, satisfaction, error rates, and rework — all of which have welfare implications beyond simple productivity metrics.
- Because providing structured context boosts effectiveness dramatically, ROI analyses should include costs of collecting, annotating, and maintaining user-context signals (e.g., lightweight telemetry, opt-in provenance) and the privacy/regulatory constraints tied to them.
- Market and regulatory considerations:
- Privacy and consent: collecting think-aloud or fine-grained interaction data raises privacy costs and compliance requirements; economic models must account for data governance overhead.
- Competitive advantage: firms that acquire rich datasets of novice workflows and build robust intent-inference will gain a differentiation edge; however, this creates potential lock-in and market-power concerns if access to proprietary datasets is key to assistant quality.
Overall, GUIDE highlights both the promise and current limitations of deploying multimodal, user-aware GUI assistants. From an economic perspective, the key levers for commercial success are reliable, structured inference of user state/intent, careful UX that preserves agency, and governance of interaction data — all of which affect costs, adoption, and the nature of labor augmentation in knowledge-work markets.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software. Research Productivity | positive | high | dataset size and composition (hours, number of demonstrations, software covered) |
n=120
67.5 hours; 120 demonstrations; 10 software
0.3
|
| GUIDE defines three tasks - (i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model's ability to recognize behavior state, reason about goals, and decide when and how to help. Decision Quality | positive | high | task definitions evaluating model capabilities (behavior detection, intent prediction, help prediction) |
0.3
|
| Evaluations across eight state-of-the-art multimodal models reveal that models achieved only 44.6% accuracy on behavior state detection. Decision Quality | negative | high | behavior state detection accuracy |
44.6% accuracy
0.18
|
| Evaluations across eight state-of-the-art multimodal models reveal that models achieved only 55.0% accuracy on help prediction. Decision Quality | negative | high | help prediction accuracy |
55.0% accuracy
0.18
|
| Providing user context significantly improved the performance, raising help prediction by up to 50.2pp. Decision Quality | positive | high | improvement in help prediction accuracy when user context is provided |
up to 50.2pp
0.18
|
| Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop). Organizational Efficiency | positive | high | potential for GUI agents to assist users |
0.09
|
| Prior research has primarily focused on automating user actions through clicks and keystrokes, this paradigm overlooks human intention, where users value the ability to explore, iterate, and refine their ideas while maintaining agency. Worker Satisfaction | mixed | medium | alignment of prior research focus with user values (automation vs. intention-preserving collaboration) |
0.05
|
| Our dataset is available at https://guide-bench.github.io. Research Productivity | positive | high | dataset availability / accessibility |
https://guide-bench.github.io
0.3
|