AI agents can assemble basic financial spreadsheets competently, but struggle to deliver professional-quality models as workflows grow complex; even top systems often err on multi-step calculations and fall short of enterprise standards.

WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

Thomson Yen, Julian Poeltl, Harshith Srinivas Gear, Yilin Meng, Joshua Fan, Adam Shen, Yili Liu, Ali Bauyrzhan, Siri Du, Haoyang Liu, Daniel Guetta, Hongseok Namkoong · May 21, 2026

arxiv descriptive medium evidence 8/10 relevance Source PDF

Frontier LLM agents (led by the Claude family) can produce professional-looking spreadsheets for simple financial tasks but frequently fail to meet professional finance standards as task complexity and chained calculations increase.

LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch. This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets. Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. To address this gap, we provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Since deliverables therein are routinely reviewed and revised by multiple stakeholders, judging their quality necessarily involves high-level criteria such as readability or ease of modification. To reflect the multidimensional nature of solution quality, we develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine-grained criteria that reflect professional standards. The Claude family leads the benchmark and produces the most professional-looking outputs in our qualitative review, but even the strongest agents frequently fall short of professional finance standards and degrade sharply as the difficulty increases beyond a few chained calculations. This suggests that current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand.

Summary

Main Finding

WorkstreamBench introduces a benchmark and evaluation pipeline for LLM agents on end-to-end, multi-sheet spreadsheet tasks in finance. Current frontier agents (GUI and API) can produce plausible-looking workbooks, but even the best models (Claude Web in the paper) frequently fail to meet professional finance standards; performance degrades sharply as task difficulty and chaining of calculations increase. An LLM-based judge (validated against human experts) can reliably assess nuanced quality dimensions that exact-value checks miss.

Key Points

Problem gap: existing spreadsheet benchmarks focus on atomic edits or QA; they do not evaluate end-to-end construction of multi-sheet financial models that must be readable, auditable, and modifiable.
WorkstreamBench: a new benchmark of diverse end-to-end finance spreadsheet tasks drawn from Financial Modeling World Cup (FMWC), ModelOff, and Wall Street Prep (WSP). Tasks cover 3-statement models, DCF, scenario analysis, debt schedules, etc., and are annotated for difficulty (levels 1–5).
Scale: WorkstreamBench tasks are far larger than prior benchmarks — ~33× more cells on average and ~93× more function calls in median compared to SpreadsheetBench.
Quality taxonomy: evaluation across three high-level dimensions:
- Accuracy (weight ~50%): correctness, completeness, sign consistency, final calculations.
- Formula (weight ~35%): logic readability/size, edge-case handling, avoidance of hardcoding, range hygiene, proper absolute references.
- Format (weight ~15%): workbook/sheet structure, readability, number notation, style consistency, presentation.
Evaluation method:
- LLM-as-judge pipeline: gives pass/fail per subcriterion and textual diagnostics in structured JSON. The judge is validated via synthetic perturbations and by comparison to 408 expert annotations.
- Collection pipeline: includes an automated Playwright-based recorder to interact with GUI-only spreadsheet agents (e.g., ChatGPT for Excel, Claude for Excel) and an internal agentic harness for API-accessible LLMs.
Judge performance: on human-annotated agent attempts the judge achieves ~0.92 accuracy, ~0.88 balanced accuracy, and ~0.85 F1 — sufficiently reliable to automate grading of nuanced criteria.
Representative failure modes the judge detects (and value checks miss): off-by-one aggregation range errors, hardcoding of computed columns/headers instead of using formulas, monolithic unreadable formulas rather than decomposed helper columns.
Model comparison: Claude Web outperforms other evaluated agents by a clear margin across Accuracy/Formula/Format. Nonetheless, even leading agents often fail important professional criteria.

Data & Methods

Task sources: curated problem set from FMWC, ModelOff, WSP to reflect real workplace modeling expectations.
Task annotations: manual labeling of task type and difficulty (1–5).
Agent pool: a mix of proprietary GUI agents (e.g., ChatGPT for Excel, Claude for Excel/Web) and API LLMs run under a consistent agentic harness; attempts collected via Playwright for GUI-based tools and via API for others.
Evaluation rubric: granular sub-dimensions mapping to professional spreadsheet standards (examples and edge cases in Appendix). Subcriteria produce pass/fail decisions plus natural-language diagnostics.
Judge validation:
- Synthetic perturbations: inject targeted errors into gold solutions to test judge sensitivity to each error class.
- Real agent outputs: compare judge labels to expert human annotations on 408 attempts; report accuracy/balanced accuracy/F1 above.
Scoring: weighted combination of sub-dimensions to produce composite scores (weights chosen to reflect finance practice priorities; adjustable).

Implications for AI Economics

Productivity vs. risk: LLM agents have clear potential to speed routine spreadsheet creation in finance, but current models are not yet reliable for unsupervised production of professional-grade models. Automating such workflows without verification risks silently introducing logic errors (off-by-one, hardcoding) that can materially change economic decisions.
Need for auditability standards: professional spreadsheet delivery emphasizes readability, modular formulas, dynamic references — properties that facilitate rapid stakeholder review and safe reuse. Benchmarks and product design should prioritize these non-numeric qualities, not just output values.
Role of automated judges / verification tooling: an LLM-as-judge approach (validated here) can scale nuanced auditing that exact-match checks miss. Such automated judges could become a practical control in finance automation pipelines (pre-commit checks, continuous auditing, red-flagging hardcoded results).
Research directions: improving agent reliability on chained calculations, dynamic references, and handling of edge cases; better tool integration (true formula-level manipulation rather than pasting values); techniques for producing readable, decomposed computations (e.g., enforced helper columns); adversarial testing and formal verification for spreadsheet logic.
Benchmark utility: WorkstreamBench provides a realistic evaluation target that better aligns research tasks to economically important workflows (valuation, forecasting, scenario analysis). Progress on this benchmark would more directly translate into safer productivity gains in finance.

If you want, I can: - extract the rubric subcriteria and weights into a compact checklist you can use as an audit template; or - produce a short list of concrete model-improvement interventions (prompting techniques, tool constraints, post-hoc verifiers) targeted at the most common failure modes reported in the paper.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Provides direct empirical evaluation of current LLM agents on realistic end-to-end spreadsheet tasks using a purpose-built taxonomy, so it gives useful, task-level evidence about capability; however, evidence is limited by likely small/selected task set, qualitative judgments, proprietary model versions, and no causal or large-scale statistical testing, so findings are indicative rather than definitive. Methods Rigormedium — The paper develops a clear multi-dimensional evaluation taxonomy (Accuracy, Formula, Format) and conducts qualitative expert review, which are strengths; but the methodology appears to rely on subjective judgments, may lack reported inter-rater reliability, quantitative scale granularity, large-sample statistical analysis, and transparency about task selection and model versions, reducing reproducibility and robustness. SampleA benchmark of end-to-end spreadsheet tasks focused on financial workflows (financial modeling, forecasting, scenario analysis), evaluated across multiple state-of-the-art LLM agent families (including the Claude family); tasks vary in difficulty and require chained calculations and deliverable formatting; outcomes assessed using a rubric of Accuracy, Formula correctness, and Format/readability, with qualitative expert review. Themesproductivity human_ai_collab GeneralizabilityLimited to finance-oriented spreadsheet tasks; other domains (e.g., engineering, scientific data) may differ, Evaluated agent families may be proprietary and version-dependent, so results may not hold for other or future model versions, Task set may be small or selectively constructed and might not represent the full range of real-world enterprise workflows, Scoring involves subjective judgments (readability, ease of modification) which may vary across reviewers and organizations, Benchmark assumes particular spreadsheet software/format conventions which could affect portability to different enterprise environments

Claims (8)

Claim	Direction	Confidence	Outcome	Details
LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. Other	positive	high	expectations of agent capabilities (trend)	0.03
Frontier AI labs have developed agents that can construct entire spreadsheets from scratch. Output Quality	positive	high	agent capability to construct spreadsheets	0.09
Existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. Adoption Rate	negative	high	coverage of benchmark tasks (end-to-end spreadsheet construction vs QA/single-formula)	0.18
We provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Research Productivity	positive	high	existence of evaluation on end-to-end spreadsheet tasks	0.18
We develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine-grained criteria that reflect professional standards. Other	positive	high	evaluation criteria/taxonomy	0.3
The Claude family leads the benchmark and produces the most professional-looking outputs in our qualitative review. Output Quality	positive	high	output professionalism/quality	0.18
Even the strongest agents frequently fall short of professional finance standards and degrade sharply as the difficulty increases beyond a few chained calculations. Output Quality	negative	high	performance relative to professional standards as task complexity increases	0.18
Current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand. Output Quality	negative	high	reliability of producing professional-quality spreadsheets for real-world complexity	0.18