AI agents struggle to automate everyday online tasks: in a 153-task live-website benchmark, leading models complete only a minority of tasks (best at about one-third), highlighting major gaps before agents can reliably replace routine web-based work.

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, Wenhu Chen, Kelsey R. Allen · April 09, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

ClawBench evaluates 153 everyday web tasks on live sites and finds that current state-of-the-art AI agents complete only a minority of tasks (best model ~33%), revealing substantial capability gaps for automating routine online work.

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

Summary

Main Finding

ClawBench introduces a realistic, safety‑preserving benchmark of 153 everyday, write‑heavy web tasks on 144 live websites and shows that state‑of‑the‑art agent models perform far worse on real web tasks than on prior sandbox benchmarks. Top models that achieve ~65–75% on existing offline/sandbox suites drop to single‑digit–to‑low‑tens percent success on ClawBench (best model: Claude Sonnet 4.6 = 33.3%; GPT‑5.4 = 6.5%). The benchmark also provides deep, traceable failure diagnostics through a five‑layer recording and an agentic evaluator.

Key Points

Scope and focus
- 153 realistic, everyday, write‑heavy tasks (purchases, reservations, applications, form fills, etc.) across 144 production websites and 15 life categories.
- Tasks were chosen to be completable by a human in under ~30 minutes and to generate observable HTTP payloads for objective verification.
Real‑web evaluation with safe interception
- Agents operate on production sites (cookie popups, dynamic JS, anti‑bot friction preserved).
- A Chrome extension + CDP server intercepts and blocks only the final irreversible HTTP submission (manually annotated per task), preventing real side effects while preserving ecological validity.
Five‑layer recording for traceability
- Synchronized session video, per‑step screenshots, full HTTP traffic (including intercepted payloads), agent messages/chain‑of‑thought/tool calls, and low‑level browser actions (clicks, keystrokes).
- Human annotators produce ground‑truth runs under the same pipeline for step‑level alignment.
Agentic Evaluator
- An LLM (Claude Code) sub‑agent compares agent trajectories to human references using a structured rubric, producing binary pass/fail verdicts plus step‑level justifications and evidence.
Empirical results (7 frontier models)
- Strong gap vs sandbox benchmarks: models that saturate WebArena/OSWorld (~65–75%) achieve much lower success on ClawBench.
- Best observed overall success: Claude Sonnet 4.6 = 33.3%. Other models range from ~24% down to near 0% (several models <10%).
- Performance varies substantially by task category (some booking/finance tasks are easier for top models; many work/dev/academic/form‑heavy tasks remain very challenging).
Open release
- Data collection and evaluation pipeline are open‑sourced to enable community maintenance and expansion.

Data & Methods

Task curation
- Natural language instruction + start URL + terminal submission target (HTTP endpoint/payload schema).
- Multi‑stage filtering to exclude paid, geo‑restricted, or nonreproducible tasks; final: 153 tasks on 144 sites.
Manual interception signal annotation
- For each task, human experts inspected network traffic during the human run and recorded the exact request (URL pattern, method, required fields) that constitutes the irreversible submission. This yields high‑precision blocking of only the terminal action.
Five‑layer synchronous recording
Session video (Xvfb + ffmpeg).
Per‑action screenshots (after each click/type/scroll).
Full HTTP logs via Chrome DevTools Protocol (including intercepted terminal payloads).
Agent messages (reasoning traces, tool calls) in structured JSON.
Low‑level browser events (click coordinates, keystrokes, tab switches).
Human ground truth
- Every task has a human reference trajectory recorded under the same five layers.
Evaluation protocol
- Agent trajectories compared to human references by an Agentic Evaluator (Claude Code) that aligns steps, checks required fields/actions, and applies task‑specific verification criteria. Outputs binary pass/fail with structured justifications and citations to evidence.
Models evaluated
- Seven frontier models (Claude Sonnet 4.6, GPT‑5.4, Gemini variants, GLM‑5, Kimi K2.5). Scores reported as percent of tasks passed.

Implications for AI Economics

Near‑term automation potential is more limited than sandbox benchmarks imply
- High success rates on offline/sandbox tests likely overestimate the economically relevant capacity of agents to autonomously complete real user transactions. Economic forecasts that use sandbox performance as a proxy for deployable automation should be revised downward or qualified by real‑web robustness metrics.
Productivity gains vs. deployment friction
- Even if agents can speed parts of workflows, low end‑to‑end success rates imply substantial human supervision, correction, or task rework. This reduces net productivity gains and limits labor displacement for tasks requiring reliable end‑to‑end completion.
Value of robustness and reliability
- Economic value from automation depends not just on average success but on error costs, transaction frequency, and trust. For many high‑value transactions (financial transfers, job applications), false positives/negatives and partial failures impose outsized costs, increasing the need for human oversight and reducing willingness to substitute labor.
Sectoral heterogeneity and labor reallocation
- ClawBench shows large variance across task categories. Sectors with more standardized, API‑driven workflows (or simpler booking flows) will see earlier productivity gains; complex, form‑heavy, or creative tasks (job applications, academic workflows, dev workflows involving dynamic UIs) will lag. This suggests partial rather than uniform labor displacement and potential reallocation toward monitoring, correction, and higher‑complexity tasks.
Cost‑benefit & risk accounting for adopters
- Firms should incorporate real‑web success rates and traceability costs into adoption evaluations. A simple expectation model: expected net benefit ≈ (success_rate × task_value) − (error_rate × error_cost + supervision_cost). ClawBench provides empirical success_rate bounds for such calculations.
Implications for investment and R&D priorities
- Economic returns are likely higher from investments that improve real‑web robustness (dynamic DOM handling, anti‑bot navigation, multi‑step form completion), error detection, and human‑in‑the‑loop tooling than from optimizing performance on sandboxed benchmarks.
Policy and regulation
- Traceable evaluation (five‑layer logs, step‑level justifications) aligns with regulatory desires for auditability and accountability when agents act on behalf of users. Regulators and firms may require similar logging and human‑in‑the‑loop controls for safety‑critical deployments.
Market design and new services
- Given current performance, business models that combine agents with human oversight (e.g., agent pre‑fill + human confirm) are likely to be more viable than fully autonomous services. There is commercial opportunity in human‑agent orchestration platforms that minimize supervision cost while leveraging partial automation.
Forecasting and macro impacts
- Macro forecasts of automation‑driven labor substitution should use task‑level, real‑web benchmarks (like ClawBench) to calibrate timelines and penetration rates. Overreliance on sandbox metrics risks overforecasting job losses and underestimating the continued demand for human work in supervision and exception handling.

Summary takeaway: ClawBench provides a more realistic, auditable measure of agents' ability to "get things done" on the live web. Its results imply that economically meaningful automation of everyday online tasks remains limited today and that research, deployment, and policy should prioritize real‑web robustness, auditability, and human‑agent orchestration to capture feasible productivity gains while managing risk.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides systematic, reproducible benchmark evidence across 153 real-world tasks on live websites and evaluates multiple frontier models, so it credibly measures current agent capabilities; however, it does not make causal claims about economic outcomes, evaluates a limited set of models and tasks, may be sensitive to transient website states, and lacks a clear human baseline or longitudinal robustness checks. Methods Rigormedium — The framework uses a thoughtful design (live-site evaluation, interception layer to avoid side effects, wide task coverage across 144 platforms and 15 categories) and runs multiple models, but rigor is limited by potential selection biases in task/platform choice, possible nondeterminism from dynamic web content, unclear scoring/verification procedures and inter-rater checks, and limited model/sample diversity and sensitivity analyses. SampleA curated suite (ClawBench) of 153 routine online tasks spanning 144 live platforms across 15 categories (e.g., purchases, appointments, job applications), evaluated on seven frontier models (both proprietary and open-source); evaluations run against production websites with a lightweight interception layer blocking only final submission requests to prevent side effects; reported model-level success rates (e.g., Claude Sonnet 4.6: 33.3%). Themesproductivity human_ai_collab adoption GeneralizabilitySelection bias: tasks and platforms are curated and may not represent the full diversity of user routines or regional/sector-specific workflows., Temporal sensitivity: results depend on live websites which change over time, reducing reproducibility and long-term validity., Model-sample limitation: only seven models evaluated, so findings may not generalize to other architectures, newer releases, or specialized agents., Evaluation proxy: blocking only the final submission may alter agent behavior versus full end-to-end execution, limiting realism., Language/region bias: likely skew toward platforms and tasks in particular languages or countries (not necessarily global)., Does not measure economic outcomes directly: capability gaps imply productivity limits but do not quantify economic impacts.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
ClawBench is an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work. Other	positive	high	benchmark_scope (number of tasks)	n=153 0.3
ClawBench spans 144 live platforms across 15 categories. Other	positive	high	benchmark_scope (platforms and categories)	n=144 0.3
The tasks in ClawBench require demanding capabilities beyond existing benchmarks, such as extracting relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and completing write-heavy operations like filling many detailed forms correctly. Other	positive	high	task_complexity / capability_requirements	n=153 0.18
Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. Other	positive	high	evaluation_realism / fidelity to real-world interactions	n=144 0.3
A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Other	positive	high	evaluation_safety (prevention of real-world side effects)	0.18
The authors evaluated 7 frontier models on ClawBench and found that both proprietary and open-source models can complete only a small portion of these tasks. Automation Exposure	negative	high	task_completion_rate / automation_exposure (how many tasks models can complete)	n=7 0.18
Claude Sonnet 4.6 achieves only 33.3% (completion rate) on ClawBench. Automation Exposure	negative	high	task_completion_rate (percentage of tasks completed)	n=153 33.3% 0.18
Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants. Other	positive	medium	long-term_agent_reliability / general-purpose_assistant_capability	0.02