A live benchmark shows state-of-the-art LLM agents complete at most two-thirds of realistic workflow tasks: the top model passes 66.7% of 105 controlled tasks. Failures cluster in HR, management and multi-system business workflows, indicating end-to-end workflow automation remains far from solved.

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

Chenxin Li, Zhengyang Tang, Huangxin Lin, Yunlong Lin, Shijue Huang, Shengyuan Liu, Bowen Ye, Rang Li, Lei Li, Benyou Wang, Yixuan Yuan · April 30, 2026

arxiv descriptive n/a evidence 8/10 relevance Source PDF

Claw-Eval-Live is a refreshable, reproducible benchmark of 105 controlled workflow tasks that finds the best frontier agent passes only 66.7% of tasks and that failures are concentrated in HR, management, and multi-system business workflows.

LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.

Summary

Main Finding

Claw-Eval-Live is a time-stamped, refreshable benchmark for end-to-end workflow agents that (1) sources task mixtures from public workflow demand signals (ClawHub Top‑500) and (2) grades agents by verifiable execution evidence (traces, audit logs, post-run artifacts) rather than final-text plausibility alone. In the current public snapshot (105 executable tasks spanning service-backed business workflows and local workspace repair) 13 frontier models were evaluated and the best model passed only 66.7% of tasks—showing that reliable workflow automation is far from solved and that failures cluster by task family and execution surface.

Key Points

Design principle: “grounded twice”
- Ground 1: Task mix is calibrated to fresh external demand signals (ClawHub Top‑500) so releases track evolving workflows.
- Ground 2: Task success is anchored in observable execution evidence (tool traces, service audit logs, workspace state), with structured LLM judging only for semantic aspects not covered by deterministic checks.
Release snapshot
- Current public snapshot: 105 tasks, 22 fine-grained families, 18 controlled services, 13 public models evaluated.
- Execution surfaces: 87 service-backed workflows (CRM, finance, email, calendar, helpdesk, KBs, multi-system coordination) and 18 workspace-repair tasks (terminal, file edits, tests).
Construction pipeline (signal → snapshot)
- Stage 1: Collect time-stamped ClawHub Top‑500 skill signals.
- Stage 2: Cluster signals into workflow patterns preserving execution-relevant distinctions.
- Stage 3: Convert patterns into family weights (distributional prior).
- Stage 4: Seed expansion and implementation into executable candidate tasks; pilot screening for runnability and discrimination.
- Stage 5: Discrimination-aware public selection: from 157 screened candidates, choose a public subset (N) using a MILP that balances release size, family coverage, and preservation of pilot-model ordering while excluding zero-discrimination tasks.
Grading methodology
- Primary evidence sources: data retrieval (tool discipline), data accuracy (entities/numbers vs ground-truth), and action verification (state-changing writes, workspace fixes).
- Hybrid graders: deterministic checks first; structured LLM judges only for constrained semantic/organizational criteria.
- Typical run settings: default budget = 24 turns, 300 seconds (longer when necessary for repair tasks); no model-specific prompt tuning.
Empirical findings
- Best model pass rate = 66.7%; no model reached 70%.
- Service-backed, multi-system business workflows (HR, management, cross‑system coordination) remain harder than local workspace repair.
- Leaderboard rank (pass rate) alone is insufficient: models with similar pass rates can diverge in completion patterns; most discrimination occurs in a middle band of tasks.

Data & Methods

Signal source: ClawHub Top‑500 snapshot (external, time-stamped).
Candidate pool: 157 screened runnable candidate tasks after pilot screening.
Public release selection: mixed-integer linear program (MILP)
- Binary selection variables xt ∈ {0,1} for each task t.
- Objective maximizes preservation of pilot-model ordering p(i,j)t across top-K pilot models.
- Constraints: fixed release size N, at least one task per fine-grained family Cf, exclusion of zero-discrimination tasks.
Execution environment: each task packaged with YAML task definition, fixtures, tool schemas (RESTful controlled services), and task-specific grader (grader.py). Controlled services record audit logs; workspaces are sandboxed and preserve post-run artifacts.
Grading
- Deterministic checks where possible (tool call logs, audit trails, exact-match of entities/numbers, post-run tests).
- Structured LLM judging only when required, constrained by explicit rubrics.
- Three recurring grader archetypes:
- Evidence-plus-judge for analytical tasks (deterministic checks + semantic rubric judge).
- Operation verification for service workflows (heavy emphasis on audit trail checks).
- Workspace repair verification (command traces + post-run artifact/tests).
Evaluation protocol: same prompts, tools, fixtures for all models; runs recorded (tool calls, tokens, wall time, artifacts); per-task score in [0,1].

Implications for AI Economics

Benchmarking that tracks demand: Using public signal priors (ClawHub) aligns evaluation mixes with what users currently try to automate—important for estimating real-world economic impact of agent automation (which workflows are most valuable/urgent).
Productivity vs reliability trade-offs: Reported pass rates (best = 66.7%) imply substantial remaining friction. Economic adoption of agents will depend not only on average task completion but on verifiable correctness for high-value workflows (especially HR, approvals, cross‑system actions), affecting firms’ willingness to substitute human labor or reassign tasks.
Deployment & contracting: Action-grounded evidence (audit logs, post-run artifacts) enables enforceable SLAs, liability allocation, and measurable performance-based pricing—critical for enterprise procurement and commercial contracts for agent-as-a-service.
Sectoral heterogeneity in automation potential: The finding that HR, management, and multi-system coordination are harder suggests uneven automation substitutability across job tasks; simpler, localized repair/technical tasks may see earlier automation-driven productivity gains than complex socio-technical workflows.
Investment signals: Firms and model providers may prioritize improving cross-system state manipulation, reliable writes, and action verification (engineering effort and R&D), rather than only improving fluent text generation—this steers capital allocation in AI tooling and integration services.
Benchmark refresh cadence matters for economic measurement: A static task mix becomes stale relative to shifting enterprise needs; time-stamped, refreshable benchmarks permit more accurate tracking of model progress against evolving market demand—important for ex ante ROI forecasts and policy analyses.
Risk and compliance externalities: The benchmark’s emphasis on auditable execution highlights regulatory and reputational risks from silent failures (plausible outputs without correct actions). Economically, this increases the value of trustworthy logging, verification tooling, and insurance products.
Procurement and specialization: Since general leaderboard rank is insufficient, buyers should evaluate agents on a task-family basis (or require domain-specific benchmarks). This favors niche/specialized agent stacks and integration services over monolithic general-purpose models for enterprise deployments.

Summary: Claw-Eval-Live provides a principled, reproducible way to measure how well agents actually "do the work" demanded in the field, not just whether they write plausible reports. For AI economics, it sharpens measurement of where automation value and risk lie, which in turn affects firm adoption, contracting, investment priorities, and sectoral labor impacts.

Assessment

Paper Typedescriptive Evidence Strengthn/a — This paper presents a benchmark and empirical performance measurements of workflow agents rather than testing causal relationships or estimating causal effects; it does not attempt identification of causal impacts on economic outcomes. Methods Rigormedium — Design is methodical: tasks are materialized as reproducible, time-stamped release snapshots; execution traces, audit logs, service state, and artifacts are recorded; deterministic checks are used where possible and structured LLM judging for semantic dimensions. However, limitations include a relatively small curated task set (105 tasks) drawn from ClawHub Top-500 skills that may introduce selection bias, reliance on structured LLM judges for some grading which can introduce subjectivity, and evaluation of only 13 frontier models which constrains coverage of agent diversity and deployment environments. SampleA release snapshot of 105 controlled tasks constructed from public workflow-demand signals (ClawHub Top-500 skills), covering controlled business services and local workspace repair scenarios; tasks use fixed fixtures, services, workspaces, and graders; evaluation runs 13 frontier LLM-based agents under a shared public pass rule while recording execution traces, audit logs, service state, and post-run artifacts for deterministic and LLM-mediated grading. Themesproductivity human_ai_collab GeneralizabilityTask set limited to 105 controlled tasks and may not represent full diversity of real-world workflows, Task selection based on ClawHub Top-500 skills introduces selection bias toward popular/curated workflows, Controlled fixtures and simulated services may not capture complexities, failures, latency, or integrations of production environments, Evaluation includes only 13 frontier models, limiting inference to other architectures, smaller models, or proprietary deployments, Structured LLM judging for semantic criteria may introduce grader variability and domain-specific misjudgments, Snapshot-based releases may age as public workflow demand and tools evolve, Likely language/region and domain biases depending on underlying ClawHub signals and task formulation

Claims (10)

Claim	Direction	Confidence	Outcome	Details
LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Task Allocation	positive	high	ability to complete end-to-end units of work	0.03
Many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. Governance And Regulation	negative	high	benchmark design adequacy for evolving workflow demand and execution verifiability	0.18
We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Other	positive	high	benchmark design (refreshable signal layer vs. time-stamped snapshot)	0.3
Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. Other	positive	high	composition of benchmark releases (source signals and materialization strategy)	n=500 0.3
For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. Governance And Regulation	positive	high	grading/verifiability pipeline (traces, logs, deterministic checks, structured LLM judging)	0.3
The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Other	positive	high	benchmark scope (number of tasks) and evaluation breadth (number of models)	n=105 0.3
Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Task Allocation	negative	high	task pass rate (task completion success)	n=105 66.7% pass rate (leading model); no model reaches 70% 0.18
Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Error Rate	mixed	high	failure distribution by task family / execution surface	n=105 0.18
Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Adoption Rate	negative	high	correspondence between leaderboard rank, pass rate, and overall completion; task-level discrimination distribution	n=13 0.18
Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action. Governance And Regulation	positive	high	evaluation grounding (use of fresh external demand signals and verifiable agent actions)	0.03