A new end-to-end benchmark for enterprise document-AI finds hybrid retrieval narrowly bests BM25 (nDCG@5 0.92 vs 0.91) and both outperform dense embedding (0.83), but parsing and retrieval quality weakly predict final answer correctness and systems often omit relevant facts despite high stated factual accuracy (85.5% vs completeness 0.40).
Most enterprise document AI today is a pipeline. Parse, index, retrieve, generate. Each of those stages has been studied to death on its own -- what's still hard is evaluating the system as a whole. We built EnterpriseDocBench to take a swing at it: parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness, all on the same corpus. The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot). We ran three pipelines through it -- BM25, dense embedding, and a hybrid -- all with the same GPT-5 generator. The headline numbers: hybrid retrieval narrowly beats BM25 (nDCG@5 of 0.92 vs. 0.91), and both beat dense embedding (0.83). Hallucination doesn't grow monotonically with document length -- short documents and very long ones both hallucinate more than medium ones (28.1% and 23.8% vs. 9.2%). Cross-stage correlations are very weak: parsing->retrieval r=0.14, parsing->generation r=0.17, retrieval->generation 0.02. If quality were cascading the way most of us assume, those numbers would be much higher; they aren't. Design caveats are real (parsing fixed, generator shared, automated proxy metrics) and we don't oversell the result. One result that genuinely surprised us: factual accuracy on stated claims is 85.5%, but answer completeness averages 0.40. The system is right when it answers -- it just leaves things out. That gap matters more for real deployments than the headline accuracy number does. We also describe three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) which are not yet integrated end-to-end. Framework, metrics, baselines, and collection scripts will be released open-source on acceptance.
Summary
Main Finding
EnterpriseDocBench introduces a unified, four-axis evaluation framework for end-to-end enterprise document AI (parsing, indexing, retrieval, generation) and a permissively licensed pilot corpus. In the pilot runs (1,169-doc test set, 500–800 query subsets) hybrid retrieval barely outperforms BM25 (nDCG@5 0.92 vs. 0.91), both substantially outperform dense embeddings (0.83). Crucially, cross-stage quality correlations are very weak (parsing→retrieval r = 0.14; parsing→generation r = 0.17; retrieval→generation r = 0.02), and while factual accuracy on stated claims is high (85.5%), answer completeness is low (mean 0.40). The authors emphasize these are suggestive results with important design caveats (fixed parsing, shared generator, automated proxies).
Key Points
- Framework: Four evaluation axes with formal metrics and validated proxies:
- Parsing fidelity (Pfidelity = 0.40·TIS + 0.30·TEA + 0.15·FCQ + 0.15·LF).
- Indexing efficiency (throughput, build latency, storage bytes/page, $/page/year cost model).
- Retrieval relevance (Precision@k, nDCG@k, MRR; query modalities specified).
- Generation groundedness (G = 0.30·FA + 0.25·SAP + 0.15·SAR + 0.20·(1−HR) + 0.10·AC).
- Pilot corpus and scale:
- 1,459 documents in pilot; 1,169-document test split used for main analyses.
- Domains: General (47%), Healthcare (22%), Tech (17%), Finance (8%), Legal (3%), Manufacturing small (n=4).
- 800 examples used for generation assessment; 500 queries used for pipeline comparison.
- Implemented pipelines (shared GPT-5 generator; parsing pre-extracted):
- BM25 baseline (indexing cost multiplier 1.0×).
- Dense embeddings (E5-large; 1.5× cost).
- Hybrid fusion (0.5/0.5 interpolation; 1.6× cost).
- Retrieval results (500 queries):
- nDCG@5: Hybrid 0.92, BM25 0.91, Dense 0.83.
- P@3 range: 0.31–0.34.
- Generation results (800 examples, GPT-5):
- Factual accuracy FA = 0.85 (95% CI 0.84–0.87).
- Hallucination rate HR = 0.15.
- Source attribution precision/recall SAP/SAR ≈ 0.61 (varies by subset).
- Answer completeness AC = 0.40 (mean).
- Aggregate G = 0.71.
- Surprising empirical patterns:
- Very weak cross-stage correlations (parsing and retrieval explain <3% of downstream variance).
- Hallucination vs. context length shows a U-shaped pattern: short and very-long contexts hallucinate more than medium-length ones (directional, not decisive due to small n in extremes).
- Systems tend to be factually correct on claims they make but omit relevant content (high FA vs. low AC).
- Known limitations:
- Parsing was held fixed (pre-extracted text), shared generator across pipelines, many automated proxies for human judgment.
- Domain imbalance and query-distribution mismatch with real enterprise workflows.
- Pilot not fully representative of proprietary enterprise documents.
Data & Methods
- Dataset construction:
- Public, permissively licensed documents (SEC EDGAR, CUAD, PubMed/PMC, arXiv, patents, etc.).
- Semi-automated QA generation (LLM proposals + two human annotators; disagreements dropped).
- Stratified sampling to reduce domain dominance; IAA target κ ≥ 0.85 for QA generation checks.
- Splits: train/val/test = 10/10/80 (no models trained on data).
- Metric validation:
- Parsing sub-metrics validated against public benchmarks (TIS vs. TEDS r = 0.89; TEA on TableBank 0.91 native, 0.68 scanned; LF vs. DocBank r = 0.82).
- Generation entailment pipeline validated (cross-encoder accuracy ~0.85; human entailment agreement ~0.79).
- Hallucination annotation: two raters on 400 (answer, context) pairs, Cohen’s κ = 0.73.
- Evaluation setup:
- Three pipelines executed with same GPT-5 generator; parsing fidelity computed against pre-extracted text (Pfidelity = 0.82 constant across pipelines).
- Retrieval-focused Quality composite used for pipeline ranking: Quality = 0.25·Parsing + 0.50·Retrieval + 0.25·Generation (reflects retrieval being the only varied axis).
- Cost model uses April 2026 pricing; reported relative indexing costs: BM25 1.0×, Dense 1.5×, Hybrid 1.6×.
- Reproducibility:
- Framework, metrics, baselines, and collection scripts to be open-sourced on acceptance; results are from a pilot release and authors specify planned scaling.
Implications for AI Economics
- Cost vs. quality trade-offs are concrete and modestly quantified:
- Simple keyword retrieval (BM25) remains highly cost-effective — near parity with the hybrid in end-to-end Quality (0.84) at 1.0× indexing cost.
- Dense embedding pipelines are costlier (1.5×) and underperform on retrieval relevance in this pilot; hybrids add cost (1.6×) but only modestly improve retrieval over BM25.
- Economic takeaway: investment in dense or hybrid retrieval must be justified by downstream gains beyond nDCG (e.g., re-ranking for multi-modal queries, latency, or specific domain gains).
- Weak inter-stage correlations change investment priorities:
- If parsing improvements explain <3% of downstream variance in similar settings, large capital expenditure on marginal parser upgrades may yield small end-user gains unless downstream components or generator behavior are changed simultaneously.
- The near-zero retrieval→generation correlation suggests marginal improvements in ranking metrics (nDCG) do not necessarily translate to better final answers; economic ROI should be evaluated on end-to-end business KPIs (task completion, human review time saved), not intermediate IR metrics alone.
- Completeness gap increases operational risk and human supervision costs:
- High factual accuracy but low completeness implies systems frequently omit relevant information, shifting workload to human reviewers to triage/complete answers. This raises recurring labor costs and compliance risk in regulated domains (legal, finance, healthcare).
- Procurement and SLA design should include completeness and coverage metrics (not just correctness or hallucination rates).
- Hallucination behavior and context length affect product design and costs:
- The U-shaped hallucination vs. context-length finding suggests there are sweet spots for context assembly; providing too little or too much context may increase hallucination and downstream verification costs.
- Engineering effort to chunk context intelligently or to build length-aware retrieval/generation logic could be cost-effective.
- Evaluation design affects perceived value:
- Because the study held parsing and generator fixed, buyers and engineering teams should require end-to-end benchmarks that vary the components they plan to change; otherwise investment decisions could be misinformed.
- Practical procurement and staffing recommendations:
- Insist on end-to-end task-based benchmarks in procurement (including completeness and human review time).
- Prioritize investments that demonstrably reduce human-in-the-loop costs (completeness, extractive coverage, reliable attribution) rather than optimizing single-axis metrics.
- Consider hybrid operational architectures (agentic routing, selective heavier pipelines on complex queries) to balance cost and service-level targets — authors propose such reference architectures but haven’t yet measured them end-to-end.
- Regulatory and liability considerations:
- High factual accuracy but noncomprehensive answers create unique liability risks (e.g., omitted contract clauses). Economic models for deployment should include expected remediation/inspection costs and potential regulatory penalties.
- Research & engineering investments with economic upside:
- Integrate alternative parsers and generator variants in end-to-end testing — likely to change correlation patterns and hence ROI calculations.
- Invest in metrics and tooling that measure completeness and false omissions, not just hallucination or per-claim accuracy.
- Build selective routing/agentic systems to apply costlier resources only where expected marginal benefit is high (the paper provides three reference architectures for this).
Shortcomings to keep in mind when using these results for economic decisions: - Results are from a pilot with pre-extracted text and a shared generator; effects may change when raw PDFs, OCR variance, and alternative generators are included. - Domain and query-distribution in the pilot skew toward academic/technical queries; enterprise-specific ROI may differ. - Many metrics are automated proxies; real-world cost benefits should be validated via human-in-the-loop time-motion studies and task completion rates.
Overall: EnterpriseDocBench shifts the evaluation focus from component-level benchmarks to end-to-end outcomes and surfaces several economically relevant trade-offs (cost multipliers, weak cascades, completeness shortfalls) that should influence procurement, engineering priorities, and ROI evaluations for enterprise document AI deployments.
Assessment
Claims (13)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Most enterprise document AI today is a pipeline: parse, index, retrieve, generate. Adoption Rate | positive | high | prevalence of pipeline architecture |
0.18
|
| We built EnterpriseDocBench to evaluate parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness on the same corpus. Output Quality | positive | high | system-level evaluation across parse/index/retrieve/generate stages |
0.18
|
| The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot). Research Productivity | positive | high | dataset domain coverage |
six domains (five in pilot)
0.18
|
| We ran three pipelines through it: BM25, dense embedding, and a hybrid, all using the same GPT-5 generator. Research Productivity | positive | high | evaluation of retrieval pipelines with a shared generator |
0.18
|
| Hybrid retrieval narrowly beats BM25 (nDCG@5 of 0.92 vs. 0.91). Output Quality | positive | high | retrieval relevance (nDCG@5) |
nDCG@5 of 0.92 vs. 0.91
0.18
|
| Both hybrid and BM25 beat dense embedding (dense embedding nDCG@5 = 0.83). Output Quality | positive | high | retrieval relevance (nDCG@5) |
dense embedding nDCG@5 = 0.83 (hybrid 0.92, BM25 0.91 reported elsewhere)
0.18
|
| Hallucination rate does not grow monotonically with document length: short documents and very long ones both hallucinate more than medium ones (28.1% and 23.8% vs. 9.2%). Error Rate | negative | high | hallucination rate (fraction of generated outputs judged hallucinated) |
28.1% and 23.8% vs. 9.2%
0.18
|
| Cross-stage correlations are very weak: parsing->retrieval r = 0.14, parsing->generation r = 0.17, retrieval->generation r = 0.02. Output Quality | null_result | high | correlation between stage-level quality metrics |
r=0.14, r=0.17, r=0.02
0.18
|
| Factual accuracy on stated claims is 85.5%. Output Quality | positive | high | factual accuracy (fraction of stated claims judged factually correct) |
85.5%
0.18
|
| Answer completeness averages 0.40. Output Quality | negative | high | answer completeness (average completeness score) |
0.40 average completeness
0.18
|
| The system tends to be factually correct when it answers but often omits information (i.e., 'the system is right when it answers — it just leaves things out'). Output Quality | mixed | high | factual accuracy vs. answer completeness |
0.18
|
| The paper describes three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) which are not yet integrated end-to-end. Other | positive | high | proposed system architectures (descriptive) |
three architectures named
0.09
|
| Framework, metrics, baselines, and collection scripts will be released open-source on acceptance. Research Productivity | positive | high | open-source release of materials |
0.09
|