A high-fidelity benchmark of junior investment-banker workflows shows frontier AI still far from ready for delegation: the best model fails roughly half of task criteria and produces no client-ready outputs in banker assessments; failures cluster on cross-document consistency and tool-based data retrieval.

BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

Elaine Lau, Markus Dücker, Ronak Chaudhary, Hui Wen Goh, Rosemary Wei, Vaibhav Kumar, Saed Qunbar, Guram Gogia, Yi Liu, Scott Millslagle, Nasim Borazjanizadeh, Ulyana Tkachenko, Samuel Eshun Danquah, Collin Schweiker, Vijay Karumathil, Asrith Devalaraju, Varsha Sandadi, Haemi Nam, Punit Arani, Ray Epps, Abdullah Arif, Sahil Bhaiwala, Curtis Northcutt, Skyler Wang, Anish Athalye, Jonas Mueller, Francisco Guzmán · April 13, 2026

arxiv descriptive low evidence 7/10 relevance Source PDF

BankerToolBench finds that frontier LLMs perform substantially below human standards on end-to-end junior investment-banker workflows—GPT-5.4 fails nearly half the rubric criteria and bankers judged none of its outputs client-ready.

Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. To evaluate frontier AI agents in a high-value, labor-intensive profession, we introduce BankerToolBench (BTB): an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. To develop an ecologically valid benchmark grounded in representative work environments, we collaborated with 502 investment bankers from leading firms. BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables--including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI. BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility. Testing 9 frontier models, we find that even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria and bankers rate 0% of its outputs as client-ready. Our failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.

Summary

Main Finding

BankerToolBench (BTB) shows that current frontier LLMs and agentic systems are not yet reliable enough to autonomously perform end-to-end junior investment banking workflows. Even the best model tested (GPT‑5.4) failed nearly half of veteran‑banker rubric criteria and produced 0% client‑ready outputs by banker judgement. BTB thus demonstrates a large gap between high scores on conventional benchmarks and the requirements for economically valuable delegation in a high‑stakes professional domain.

Key Points

Purpose: BTB is an open‑source benchmark that measures agents on full investment‑banking workflows (from request to multi‑file client deliverables) rather than isolated language tasks.
Professional grounding: Tasks, prompts and rubrics were developed with direct input from hundreds of current/former bankers (502 collaborators) to ensure ecological validity.
Scope and realism:
- 100 tasks focused on junior‑banker work (M&A, LevFin, ECM, DCM), with human completion time averaging 5 hours and up to 21 hours per task.
- Deliverables include Excel financial models (multi‑tab), PowerPoint pitch decks, and Word/PDF memos—files must be accurate, auditable, polished, and internally consistent.
Granular evaluation:
- Each task has a detailed rubric (binary Pass/Fail per item) weighted by importance (1, 3, 5, 10); most rubrics average ~150 criteria (100+ common).
- An LLM‑based agentic verifier automatically scores outputs against rubrics to enable reproducible evaluation.
Environment & tools:
- Packaged as a reinforcement‑learning environment in the Harbor framework; agents get a data room (preloaded files), preinstalled file‑manipulation software/libraries, and three MCP tools: market data API, SEC EDGAR API, and company profile API.
- Tasks are grounded to a historical date (no internet access), exposing real data quirks (gaps, nonstandard reporting).
- Agents required up to 539 LLM calls per task; ~97% of steps involve tool calls or code execution.
Empirical result snapshot:
- Evaluated 9 frontier models/agent harnesses.
- Best performer (GPT‑5.4) still failed nearly half the rubric criteria and bankers judged 0% of its outputs as client‑ready.
Common failure modes identified:
- Cross‑artifact inconsistency (e.g., mismatched numbers/labels between Excel and slides).
- Financial model integrity problems (hardcoded values, broken links, incorrect formulas).
- Poor instruction‑following and missing required elements.
- Numerical errors, hallucinated or improperly retrieved data, and brittle tool integration.
How BTB differs from prior benchmarks:
- Focused depth on one profession vs. occupational breadth.
- Requires multi‑file, end‑to‑end outputs and fine‑grained, utility‑oriented grading rather than short text or QA.

Data & Methods

Task set: 100 realistic junior‑banker tasks derived from a stratified industry survey and expert input; distribution includes:
- Product groups: M&A 62%, Leveraged Finance 19%, ECM 10%, DCM 6%, mixed 3%.
- Workflow categories: Financial modeling & scenarios 37%, Valuation & pricing 30%, Client & marketing materials 27%, others 6%.
- Examples: DCFs, LBO/credit models, trading comparables, pitchbooks, teasers, sources & uses, sensitivity analyses.
Environment:
- Harbor RLE orchestrates runs in isolated sandboxes; preinstalled tools/libraries for Excel/PPT/PDF manipulation (LibreOffice, openpyxl, python‑pptx, pandas, numpy, etc.).
- Data rooms: PDFs, spreadsheets, slides, images bundled per task; MCP tools emulate market data and SEC filings constrained to historical cutoffs.
Rubrics and verifier:
- Expert‑authored rubrics: binary checks with weighted importance to reflect stakeholder utility and client readiness; average ~150 criteria per task.
- Agentic verifier: an LLM‑powered system that programmatically evaluates deliverables against rubric items to produce reproducible scoring.
Evaluation:
- Multiple agent harnesses and 9 frontier models were run on the benchmark; agent behavior involved heavy tool access and code execution.
- Outputs were graded by the automated verifier and also subject to banker judgement for client readiness.
- Human reference outputs (expected deliverables) created by bankers and provided for reference (not visible to agents during runs).

Implications for AI Economics

Delegation readiness is not just capability; it requires high reliability, cross‑artifact consistency, and auditable numerical correctness. Current models are far from the threshold where firms can safely delegate end‑to‑end, high‑stakes analytical work.
Economic potential vs. risk:
- Investment banking is high‑value (industry fees cited >$140B in 2025). Reliable automation of junior workflows could yield substantial productivity and cost effects, but partial or error‑prone automation risks large financial and reputational harm.
Adoption path:
- Near term: safer deployments are human‑in‑the‑loop tools for assistance (drafting, data retrieval, first‑pass calculations) combined with mandatory expert verification and stronger toolchains for auditing results.
- Medium term: progress requires domain‑specific fine‑tuning, improved tool integration, verifiers that can certify cross‑file integrity, and models trained on end‑to‑end workflows with reward signals aligned to professional rubrics.
Research directions with economic relevance:
- Develop and benchmark systems on domain‑specific, end‑to‑end tasks tied to economic value (like BTB) rather than proxy tasks.
- Improve automatic verifiers and formal checks of numerical/model integrity to reduce human verification costs and enable safer delegation.
- Study how model improvements translate into real productivity gains, reallocation of labor (task shifting vs. job displacement), and changes in price and quality of financial services.
- Quantify adoption thresholds: what reliability/verification guarantees are necessary for firms to offload particular types of work?
Policy and firm considerations:
- Regulators and firms should require auditable workflows and thorough testing on representative, high‑stakes tasks before permitting autonomous delegation.
- Investment in upskilling and in-house tooling (verifiers, guarded tool APIs, historical grounding) will influence how automation affects labor demand and value capture.
Value of benchmarking for economic assessment:
- BTB exemplifies how capability‑based, profession‑specific benchmarks can provide more informative proxies for economic impact than general NLP/LLM metrics. Measuring progress on such benchmarks helps map AI capability improvements to plausible economic outcomes and risks.

Assessment

Paper Typedescriptive Evidence Strengthlow — The paper provides a benchmarked performance evaluation of LLMs on simulated, high-fidelity professional tasks rather than causal estimates of AI's impact on real-world productivity, wages, or firm outcomes; results are descriptive of model capabilities on the benchmark but do not establish economic effects. Methods Rigormedium — The benchmark was developed with substantial domain input (502 investment bankers), uses multi-file, end-to-end tasks and a large rubric (100+ criteria) and tests multiple frontier models, which supports internal validity and ecological fidelity; however, potential biases remain from simulated tasks, rubric design, model/agent configuration heterogeneity, limited model sample, and the lack of field validation of downstream economic impacts. SampleBenchmark tasks and rubrics were developed in collaboration with 502 investment bankers from leading firms; tasks emulate junior investment-banker workflows requiring navigation of data rooms and industry tools (market-data platform, SEC filings), producing multi-file deliverables (Excel financial models, PowerPoint decks, PDF/Word reports); automated scoring against 100+ banker-defined criteria; nine frontier models evaluated (including GPT-5.4); banker ratings used to assess client-readiness. Themesproductivity human_ai_collab GeneralizabilityFocused on junior investment-banking workflows at leading firms; results may not generalize to other professions, industries, or senior-level tasks., Tasks are high-fidelity simulations rather than measurements of real client engagements or in-production delegation, so real-world performance and downstream economic effects are uncertain., Rubric criteria and assessments reflect veteran banker norms and may be firm- or region-specific., Evaluated models and agent setups are a snapshot in time; rapid model and tool-chain changes may alter findings., Automated scoring may miss qualitative aspects of deliverable utility judged in practice.

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. Other	negative	high	fidelity of AI benchmarks to professional workflows	0.03
We collaborated with 502 investment bankers from leading firms to develop an ecologically valid benchmark grounded in representative work environments. Other	null_result	high	number of investment bankers collaborating on benchmark development	n=502 0.3
BankerToolBench (BTB) is an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. Other	null_result	high	existence and scope of the BTB benchmark	0.3
BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Other	null_result	high	types of tasks and deliverables required by BTB	0.3
Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI. Task Completion Time	null_result	high	time to complete a BTB task (hours)	up to 21 hours 0.18
BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility. Other	positive	high	number of rubric criteria for automated evaluation	100+ rubric criteria 0.3
We tested 9 frontier models on BTB. Other	null_result	high	number of models evaluated	n=9 0.3
Even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria. Output Quality	negative	high	rubric criteria pass/fail rate for GPT-5.4	fails nearly half of the rubric criteria 0.18
Bankers rate 0% of GPT-5.4's outputs as client-ready. Output Quality	negative	high	proportion of model outputs rated as client-ready by bankers	0% 0.18
Failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows. Error Rate	negative	high	types of failure modes encountered (e.g., cross-artifact consistency issues)	0.18