A high-fidelity benchmark of junior investment-banker workflows shows frontier AI still far from ready for delegation: the best model fails roughly half of task criteria and produces no client-ready outputs in banker assessments; failures cluster on cross-document consistency and tool-based data retrieval.
Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. To evaluate frontier AI agents in a high-value, labor-intensive profession, we introduce BankerToolBench (BTB): an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. To develop an ecologically valid benchmark grounded in representative work environments, we collaborated with 502 investment bankers from leading firms. BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables--including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI. BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility. Testing 9 frontier models, we find that even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria and bankers rate 0% of its outputs as client-ready. Our failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.
Summary
Main Finding
BankerToolBench (BTB) shows that current frontier LLMs and agentic systems are not yet reliable enough to autonomously perform end-to-end junior investment banking workflows. Even the best model tested (GPT‑5.4) failed nearly half of veteran‑banker rubric criteria and produced 0% client‑ready outputs by banker judgement. BTB thus demonstrates a large gap between high scores on conventional benchmarks and the requirements for economically valuable delegation in a high‑stakes professional domain.
Key Points
- Purpose: BTB is an open‑source benchmark that measures agents on full investment‑banking workflows (from request to multi‑file client deliverables) rather than isolated language tasks.
- Professional grounding: Tasks, prompts and rubrics were developed with direct input from hundreds of current/former bankers (502 collaborators) to ensure ecological validity.
- Scope and realism:
- 100 tasks focused on junior‑banker work (M&A, LevFin, ECM, DCM), with human completion time averaging 5 hours and up to 21 hours per task.
- Deliverables include Excel financial models (multi‑tab), PowerPoint pitch decks, and Word/PDF memos—files must be accurate, auditable, polished, and internally consistent.
- Granular evaluation:
- Each task has a detailed rubric (binary Pass/Fail per item) weighted by importance (1, 3, 5, 10); most rubrics average ~150 criteria (100+ common).
- An LLM‑based agentic verifier automatically scores outputs against rubrics to enable reproducible evaluation.
- Environment & tools:
- Packaged as a reinforcement‑learning environment in the Harbor framework; agents get a data room (preloaded files), preinstalled file‑manipulation software/libraries, and three MCP tools: market data API, SEC EDGAR API, and company profile API.
- Tasks are grounded to a historical date (no internet access), exposing real data quirks (gaps, nonstandard reporting).
- Agents required up to 539 LLM calls per task; ~97% of steps involve tool calls or code execution.
- Empirical result snapshot:
- Evaluated 9 frontier models/agent harnesses.
- Best performer (GPT‑5.4) still failed nearly half the rubric criteria and bankers judged 0% of its outputs as client‑ready.
- Common failure modes identified:
- Cross‑artifact inconsistency (e.g., mismatched numbers/labels between Excel and slides).
- Financial model integrity problems (hardcoded values, broken links, incorrect formulas).
- Poor instruction‑following and missing required elements.
- Numerical errors, hallucinated or improperly retrieved data, and brittle tool integration.
- How BTB differs from prior benchmarks:
- Focused depth on one profession vs. occupational breadth.
- Requires multi‑file, end‑to‑end outputs and fine‑grained, utility‑oriented grading rather than short text or QA.
Data & Methods
- Task set: 100 realistic junior‑banker tasks derived from a stratified industry survey and expert input; distribution includes:
- Product groups: M&A 62%, Leveraged Finance 19%, ECM 10%, DCM 6%, mixed 3%.
- Workflow categories: Financial modeling & scenarios 37%, Valuation & pricing 30%, Client & marketing materials 27%, others 6%.
- Examples: DCFs, LBO/credit models, trading comparables, pitchbooks, teasers, sources & uses, sensitivity analyses.
- Environment:
- Harbor RLE orchestrates runs in isolated sandboxes; preinstalled tools/libraries for Excel/PPT/PDF manipulation (LibreOffice, openpyxl, python‑pptx, pandas, numpy, etc.).
- Data rooms: PDFs, spreadsheets, slides, images bundled per task; MCP tools emulate market data and SEC filings constrained to historical cutoffs.
- Rubrics and verifier:
- Expert‑authored rubrics: binary checks with weighted importance to reflect stakeholder utility and client readiness; average ~150 criteria per task.
- Agentic verifier: an LLM‑powered system that programmatically evaluates deliverables against rubric items to produce reproducible scoring.
- Evaluation:
- Multiple agent harnesses and 9 frontier models were run on the benchmark; agent behavior involved heavy tool access and code execution.
- Outputs were graded by the automated verifier and also subject to banker judgement for client readiness.
- Human reference outputs (expected deliverables) created by bankers and provided for reference (not visible to agents during runs).
Implications for AI Economics
- Delegation readiness is not just capability; it requires high reliability, cross‑artifact consistency, and auditable numerical correctness. Current models are far from the threshold where firms can safely delegate end‑to‑end, high‑stakes analytical work.
- Economic potential vs. risk:
- Investment banking is high‑value (industry fees cited >$140B in 2025). Reliable automation of junior workflows could yield substantial productivity and cost effects, but partial or error‑prone automation risks large financial and reputational harm.
- Adoption path:
- Near term: safer deployments are human‑in‑the‑loop tools for assistance (drafting, data retrieval, first‑pass calculations) combined with mandatory expert verification and stronger toolchains for auditing results.
- Medium term: progress requires domain‑specific fine‑tuning, improved tool integration, verifiers that can certify cross‑file integrity, and models trained on end‑to‑end workflows with reward signals aligned to professional rubrics.
- Research directions with economic relevance:
- Develop and benchmark systems on domain‑specific, end‑to‑end tasks tied to economic value (like BTB) rather than proxy tasks.
- Improve automatic verifiers and formal checks of numerical/model integrity to reduce human verification costs and enable safer delegation.
- Study how model improvements translate into real productivity gains, reallocation of labor (task shifting vs. job displacement), and changes in price and quality of financial services.
- Quantify adoption thresholds: what reliability/verification guarantees are necessary for firms to offload particular types of work?
- Policy and firm considerations:
- Regulators and firms should require auditable workflows and thorough testing on representative, high‑stakes tasks before permitting autonomous delegation.
- Investment in upskilling and in-house tooling (verifiers, guarded tool APIs, historical grounding) will influence how automation affects labor demand and value capture.
- Value of benchmarking for economic assessment:
- BTB exemplifies how capability‑based, profession‑specific benchmarks can provide more informative proxies for economic impact than general NLP/LLM metrics. Measuring progress on such benchmarks helps map AI capability improvements to plausible economic outcomes and risks.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. Other | negative | high | fidelity of AI benchmarks to professional workflows |
0.03
|
| We collaborated with 502 investment bankers from leading firms to develop an ecologically valid benchmark grounded in representative work environments. Other | null_result | high | number of investment bankers collaborating on benchmark development |
n=502
0.3
|
| BankerToolBench (BTB) is an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. Other | null_result | high | existence and scope of the BTB benchmark |
0.3
|
| BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Other | null_result | high | types of tasks and deliverables required by BTB |
0.3
|
| Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI. Task Completion Time | null_result | high | time to complete a BTB task (hours) |
up to 21 hours
0.18
|
| BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility. Other | positive | high | number of rubric criteria for automated evaluation |
100+ rubric criteria
0.3
|
| We tested 9 frontier models on BTB. Other | null_result | high | number of models evaluated |
n=9
0.3
|
| Even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria. Output Quality | negative | high | rubric criteria pass/fail rate for GPT-5.4 |
fails nearly half of the rubric criteria
0.18
|
| Bankers rate 0% of GPT-5.4's outputs as client-ready. Output Quality | negative | high | proportion of model outputs rated as client-ready by bankers |
0%
0.18
|
| Failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows. Error Rate | negative | high | types of failure modes encountered (e.g., cross-artifact consistency issues) |
0.18
|