AI agents struggle to automate everyday online tasks: in a 153-task live-website benchmark, leading models complete only a minority of tasks (best at about one-third), highlighting major gaps before agents can reliably replace routine web-based work.
AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.
Summary
Main Finding
ClawBench introduces a realistic, safety‑preserving benchmark of 153 everyday, write‑heavy web tasks on 144 live websites and shows that state‑of‑the‑art agent models perform far worse on real web tasks than on prior sandbox benchmarks. Top models that achieve ~65–75% on existing offline/sandbox suites drop to single‑digit–to‑low‑tens percent success on ClawBench (best model: Claude Sonnet 4.6 = 33.3%; GPT‑5.4 = 6.5%). The benchmark also provides deep, traceable failure diagnostics through a five‑layer recording and an agentic evaluator.
Key Points
- Scope and focus
- 153 realistic, everyday, write‑heavy tasks (purchases, reservations, applications, form fills, etc.) across 144 production websites and 15 life categories.
- Tasks were chosen to be completable by a human in under ~30 minutes and to generate observable HTTP payloads for objective verification.
- Real‑web evaluation with safe interception
- Agents operate on production sites (cookie popups, dynamic JS, anti‑bot friction preserved).
- A Chrome extension + CDP server intercepts and blocks only the final irreversible HTTP submission (manually annotated per task), preventing real side effects while preserving ecological validity.
- Five‑layer recording for traceability
- Synchronized session video, per‑step screenshots, full HTTP traffic (including intercepted payloads), agent messages/chain‑of‑thought/tool calls, and low‑level browser actions (clicks, keystrokes).
- Human annotators produce ground‑truth runs under the same pipeline for step‑level alignment.
- Agentic Evaluator
- An LLM (Claude Code) sub‑agent compares agent trajectories to human references using a structured rubric, producing binary pass/fail verdicts plus step‑level justifications and evidence.
- Empirical results (7 frontier models)
- Strong gap vs sandbox benchmarks: models that saturate WebArena/OSWorld (~65–75%) achieve much lower success on ClawBench.
- Best observed overall success: Claude Sonnet 4.6 = 33.3%. Other models range from ~24% down to near 0% (several models <10%).
- Performance varies substantially by task category (some booking/finance tasks are easier for top models; many work/dev/academic/form‑heavy tasks remain very challenging).
- Open release
- Data collection and evaluation pipeline are open‑sourced to enable community maintenance and expansion.
Data & Methods
- Task curation
- Natural language instruction + start URL + terminal submission target (HTTP endpoint/payload schema).
- Multi‑stage filtering to exclude paid, geo‑restricted, or nonreproducible tasks; final: 153 tasks on 144 sites.
- Manual interception signal annotation
- For each task, human experts inspected network traffic during the human run and recorded the exact request (URL pattern, method, required fields) that constitutes the irreversible submission. This yields high‑precision blocking of only the terminal action.
- Five‑layer synchronous recording
- Session video (Xvfb + ffmpeg).
- Per‑action screenshots (after each click/type/scroll).
- Full HTTP logs via Chrome DevTools Protocol (including intercepted terminal payloads).
- Agent messages (reasoning traces, tool calls) in structured JSON.
- Low‑level browser events (click coordinates, keystrokes, tab switches).
- Human ground truth
- Every task has a human reference trajectory recorded under the same five layers.
- Evaluation protocol
- Agent trajectories compared to human references by an Agentic Evaluator (Claude Code) that aligns steps, checks required fields/actions, and applies task‑specific verification criteria. Outputs binary pass/fail with structured justifications and citations to evidence.
- Models evaluated
- Seven frontier models (Claude Sonnet 4.6, GPT‑5.4, Gemini variants, GLM‑5, Kimi K2.5). Scores reported as percent of tasks passed.
Implications for AI Economics
- Near‑term automation potential is more limited than sandbox benchmarks imply
- High success rates on offline/sandbox tests likely overestimate the economically relevant capacity of agents to autonomously complete real user transactions. Economic forecasts that use sandbox performance as a proxy for deployable automation should be revised downward or qualified by real‑web robustness metrics.
- Productivity gains vs. deployment friction
- Even if agents can speed parts of workflows, low end‑to‑end success rates imply substantial human supervision, correction, or task rework. This reduces net productivity gains and limits labor displacement for tasks requiring reliable end‑to‑end completion.
- Value of robustness and reliability
- Economic value from automation depends not just on average success but on error costs, transaction frequency, and trust. For many high‑value transactions (financial transfers, job applications), false positives/negatives and partial failures impose outsized costs, increasing the need for human oversight and reducing willingness to substitute labor.
- Sectoral heterogeneity and labor reallocation
- ClawBench shows large variance across task categories. Sectors with more standardized, API‑driven workflows (or simpler booking flows) will see earlier productivity gains; complex, form‑heavy, or creative tasks (job applications, academic workflows, dev workflows involving dynamic UIs) will lag. This suggests partial rather than uniform labor displacement and potential reallocation toward monitoring, correction, and higher‑complexity tasks.
- Cost‑benefit & risk accounting for adopters
- Firms should incorporate real‑web success rates and traceability costs into adoption evaluations. A simple expectation model: expected net benefit ≈ (success_rate × task_value) − (error_rate × error_cost + supervision_cost). ClawBench provides empirical success_rate bounds for such calculations.
- Implications for investment and R&D priorities
- Economic returns are likely higher from investments that improve real‑web robustness (dynamic DOM handling, anti‑bot navigation, multi‑step form completion), error detection, and human‑in‑the‑loop tooling than from optimizing performance on sandboxed benchmarks.
- Policy and regulation
- Traceable evaluation (five‑layer logs, step‑level justifications) aligns with regulatory desires for auditability and accountability when agents act on behalf of users. Regulators and firms may require similar logging and human‑in‑the‑loop controls for safety‑critical deployments.
- Market design and new services
- Given current performance, business models that combine agents with human oversight (e.g., agent pre‑fill + human confirm) are likely to be more viable than fully autonomous services. There is commercial opportunity in human‑agent orchestration platforms that minimize supervision cost while leveraging partial automation.
- Forecasting and macro impacts
- Macro forecasts of automation‑driven labor substitution should use task‑level, real‑web benchmarks (like ClawBench) to calibrate timelines and penetration rates. Overreliance on sandbox metrics risks overforecasting job losses and underestimating the continued demand for human work in supervision and exception handling.
Summary takeaway: ClawBench provides a more realistic, auditable measure of agents' ability to "get things done" on the live web. Its results imply that economically meaningful automation of everyday online tasks remains limited today and that research, deployment, and policy should prioritize real‑web robustness, auditability, and human‑agent orchestration to capture feasible productivity gains while managing risk.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| ClawBench is an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work. Other | positive | high | benchmark_scope (number of tasks) |
n=153
0.3
|
| ClawBench spans 144 live platforms across 15 categories. Other | positive | high | benchmark_scope (platforms and categories) |
n=144
0.3
|
| The tasks in ClawBench require demanding capabilities beyond existing benchmarks, such as extracting relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and completing write-heavy operations like filling many detailed forms correctly. Other | positive | high | task_complexity / capability_requirements |
n=153
0.18
|
| Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. Other | positive | high | evaluation_realism / fidelity to real-world interactions |
n=144
0.3
|
| A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Other | positive | high | evaluation_safety (prevention of real-world side effects) |
0.18
|
| The authors evaluated 7 frontier models on ClawBench and found that both proprietary and open-source models can complete only a small portion of these tasks. Automation Exposure | negative | high | task_completion_rate / automation_exposure (how many tasks models can complete) |
n=7
0.18
|
| Claude Sonnet 4.6 achieves only 33.3% (completion rate) on ClawBench. Automation Exposure | negative | high | task_completion_rate (percentage of tasks completed) |
n=153
33.3%
0.18
|
| Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants. Other | positive | medium | long-term_agent_reliability / general-purpose_assistant_capability |
0.02
|