AI agents can assemble basic financial spreadsheets competently, but struggle to deliver professional-quality models as workflows grow complex; even top systems often err on multi-step calculations and fall short of enterprise standards.
LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch. This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets. Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. To address this gap, we provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Since deliverables therein are routinely reviewed and revised by multiple stakeholders, judging their quality necessarily involves high-level criteria such as readability or ease of modification. To reflect the multidimensional nature of solution quality, we develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine-grained criteria that reflect professional standards. The Claude family leads the benchmark and produces the most professional-looking outputs in our qualitative review, but even the strongest agents frequently fall short of professional finance standards and degrade sharply as the difficulty increases beyond a few chained calculations. This suggests that current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand.
Summary
Main Finding
WorkstreamBench introduces a benchmark and evaluation pipeline for LLM agents on end-to-end, multi-sheet spreadsheet tasks in finance. Current frontier agents (GUI and API) can produce plausible-looking workbooks, but even the best models (Claude Web in the paper) frequently fail to meet professional finance standards; performance degrades sharply as task difficulty and chaining of calculations increase. An LLM-based judge (validated against human experts) can reliably assess nuanced quality dimensions that exact-value checks miss.
Key Points
- Problem gap: existing spreadsheet benchmarks focus on atomic edits or QA; they do not evaluate end-to-end construction of multi-sheet financial models that must be readable, auditable, and modifiable.
- WorkstreamBench: a new benchmark of diverse end-to-end finance spreadsheet tasks drawn from Financial Modeling World Cup (FMWC), ModelOff, and Wall Street Prep (WSP). Tasks cover 3-statement models, DCF, scenario analysis, debt schedules, etc., and are annotated for difficulty (levels 1–5).
- Scale: WorkstreamBench tasks are far larger than prior benchmarks — ~33× more cells on average and ~93× more function calls in median compared to SpreadsheetBench.
- Quality taxonomy: evaluation across three high-level dimensions:
- Accuracy (weight ~50%): correctness, completeness, sign consistency, final calculations.
- Formula (weight ~35%): logic readability/size, edge-case handling, avoidance of hardcoding, range hygiene, proper absolute references.
- Format (weight ~15%): workbook/sheet structure, readability, number notation, style consistency, presentation.
- Evaluation method:
- LLM-as-judge pipeline: gives pass/fail per subcriterion and textual diagnostics in structured JSON. The judge is validated via synthetic perturbations and by comparison to 408 expert annotations.
- Collection pipeline: includes an automated Playwright-based recorder to interact with GUI-only spreadsheet agents (e.g., ChatGPT for Excel, Claude for Excel) and an internal agentic harness for API-accessible LLMs.
- Judge performance: on human-annotated agent attempts the judge achieves ~0.92 accuracy, ~0.88 balanced accuracy, and ~0.85 F1 — sufficiently reliable to automate grading of nuanced criteria.
- Representative failure modes the judge detects (and value checks miss): off-by-one aggregation range errors, hardcoding of computed columns/headers instead of using formulas, monolithic unreadable formulas rather than decomposed helper columns.
- Model comparison: Claude Web outperforms other evaluated agents by a clear margin across Accuracy/Formula/Format. Nonetheless, even leading agents often fail important professional criteria.
Data & Methods
- Task sources: curated problem set from FMWC, ModelOff, WSP to reflect real workplace modeling expectations.
- Task annotations: manual labeling of task type and difficulty (1–5).
- Agent pool: a mix of proprietary GUI agents (e.g., ChatGPT for Excel, Claude for Excel/Web) and API LLMs run under a consistent agentic harness; attempts collected via Playwright for GUI-based tools and via API for others.
- Evaluation rubric: granular sub-dimensions mapping to professional spreadsheet standards (examples and edge cases in Appendix). Subcriteria produce pass/fail decisions plus natural-language diagnostics.
- Judge validation:
- Synthetic perturbations: inject targeted errors into gold solutions to test judge sensitivity to each error class.
- Real agent outputs: compare judge labels to expert human annotations on 408 attempts; report accuracy/balanced accuracy/F1 above.
- Scoring: weighted combination of sub-dimensions to produce composite scores (weights chosen to reflect finance practice priorities; adjustable).
Implications for AI Economics
- Productivity vs. risk: LLM agents have clear potential to speed routine spreadsheet creation in finance, but current models are not yet reliable for unsupervised production of professional-grade models. Automating such workflows without verification risks silently introducing logic errors (off-by-one, hardcoding) that can materially change economic decisions.
- Need for auditability standards: professional spreadsheet delivery emphasizes readability, modular formulas, dynamic references — properties that facilitate rapid stakeholder review and safe reuse. Benchmarks and product design should prioritize these non-numeric qualities, not just output values.
- Role of automated judges / verification tooling: an LLM-as-judge approach (validated here) can scale nuanced auditing that exact-match checks miss. Such automated judges could become a practical control in finance automation pipelines (pre-commit checks, continuous auditing, red-flagging hardcoded results).
- Research directions: improving agent reliability on chained calculations, dynamic references, and handling of edge cases; better tool integration (true formula-level manipulation rather than pasting values); techniques for producing readable, decomposed computations (e.g., enforced helper columns); adversarial testing and formal verification for spreadsheet logic.
- Benchmark utility: WorkstreamBench provides a realistic evaluation target that better aligns research tasks to economically important workflows (valuation, forecasting, scenario analysis). Progress on this benchmark would more directly translate into safer productivity gains in finance.
If you want, I can: - extract the rubric subcriteria and weights into a compact checklist you can use as an audit template; or - produce a short list of concrete model-improvement interventions (prompting techniques, tool constraints, post-hoc verifiers) targeted at the most common failure modes reported in the paper.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. Other | positive | high | expectations of agent capabilities (trend) |
0.03
|
| Frontier AI labs have developed agents that can construct entire spreadsheets from scratch. Output Quality | positive | high | agent capability to construct spreadsheets |
0.09
|
| Existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. Adoption Rate | negative | high | coverage of benchmark tasks (end-to-end spreadsheet construction vs QA/single-formula) |
0.18
|
| We provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Research Productivity | positive | high | existence of evaluation on end-to-end spreadsheet tasks |
0.18
|
| We develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine-grained criteria that reflect professional standards. Other | positive | high | evaluation criteria/taxonomy |
0.3
|
| The Claude family leads the benchmark and produces the most professional-looking outputs in our qualitative review. Output Quality | positive | high | output professionalism/quality |
0.18
|
| Even the strongest agents frequently fall short of professional finance standards and degrade sharply as the difficulty increases beyond a few chained calculations. Output Quality | negative | high | performance relative to professional standards as task complexity increases |
0.18
|
| Current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand. Output Quality | negative | high | reliability of producing professional-quality spreadsheets for real-world complexity |
0.18
|