A new end-to-end benchmark for enterprise document-AI finds hybrid retrieval narrowly bests BM25 (nDCG@5 0.92 vs 0.91) and both outperform dense embedding (0.83), but parsing and retrieval quality weakly predict final answer correctness and systems often omit relevant facts despite high stated factual accuracy (85.5% vs completeness 0.40).

Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI

Saurabh K. Singh, Sachin Raj · April 29, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

EnterpriseDocBench is an end-to-end benchmark of enterprise document-AI pipelines showing hybrid retrieval slightly outperforms BM25 and both beat dense embeddings, hallucination rates vary non-monotonically with document length, and upstream stage quality only weakly predicts final generation fidelity while systems tend to omit information even when factual accuracy is high.

Most enterprise document AI today is a pipeline. Parse, index, retrieve, generate. Each of those stages has been studied to death on its own -- what's still hard is evaluating the system as a whole. We built EnterpriseDocBench to take a swing at it: parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness, all on the same corpus. The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot). We ran three pipelines through it -- BM25, dense embedding, and a hybrid -- all with the same GPT-5 generator. The headline numbers: hybrid retrieval narrowly beats BM25 (nDCG@5 of 0.92 vs. 0.91), and both beat dense embedding (0.83). Hallucination doesn't grow monotonically with document length -- short documents and very long ones both hallucinate more than medium ones (28.1% and 23.8% vs. 9.2%). Cross-stage correlations are very weak: parsing->retrieval r=0.14, parsing->generation r=0.17, retrieval->generation 0.02. If quality were cascading the way most of us assume, those numbers would be much higher; they aren't. Design caveats are real (parsing fixed, generator shared, automated proxy metrics) and we don't oversell the result. One result that genuinely surprised us: factual accuracy on stated claims is 85.5%, but answer completeness averages 0.40. The system is right when it answers -- it just leaves things out. That gap matters more for real deployments than the headline accuracy number does. We also describe three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) which are not yet integrated end-to-end. Framework, metrics, baselines, and collection scripts will be released open-source on acceptance.

Summary

Main Finding

EnterpriseDocBench introduces a unified, four-axis evaluation framework for end-to-end enterprise document AI (parsing, indexing, retrieval, generation) and a permissively licensed pilot corpus. In the pilot runs (1,169-doc test set, 500–800 query subsets) hybrid retrieval barely outperforms BM25 (nDCG@5 0.92 vs. 0.91), both substantially outperform dense embeddings (0.83). Crucially, cross-stage quality correlations are very weak (parsing→retrieval r = 0.14; parsing→generation r = 0.17; retrieval→generation r = 0.02), and while factual accuracy on stated claims is high (85.5%), answer completeness is low (mean 0.40). The authors emphasize these are suggestive results with important design caveats (fixed parsing, shared generator, automated proxies).

Key Points

Framework: Four evaluation axes with formal metrics and validated proxies:
- Parsing fidelity (Pfidelity = 0.40·TIS + 0.30·TEA + 0.15·FCQ + 0.15·LF).
- Indexing efficiency (throughput, build latency, storage bytes/page, $/page/year cost model).
- Retrieval relevance (Precision@k, nDCG@k, MRR; query modalities specified).
- Generation groundedness (G = 0.30·FA + 0.25·SAP + 0.15·SAR + 0.20·(1−HR) + 0.10·AC).
Pilot corpus and scale:
- 1,459 documents in pilot; 1,169-document test split used for main analyses.
- Domains: General (47%), Healthcare (22%), Tech (17%), Finance (8%), Legal (3%), Manufacturing small (n=4).
- 800 examples used for generation assessment; 500 queries used for pipeline comparison.
Implemented pipelines (shared GPT-5 generator; parsing pre-extracted):
- BM25 baseline (indexing cost multiplier 1.0×).
- Dense embeddings (E5-large; 1.5× cost).
- Hybrid fusion (0.5/0.5 interpolation; 1.6× cost).
Retrieval results (500 queries):
- nDCG@5: Hybrid 0.92, BM25 0.91, Dense 0.83.
- P@3 range: 0.31–0.34.
Generation results (800 examples, GPT-5):
- Factual accuracy FA = 0.85 (95% CI 0.84–0.87).
- Hallucination rate HR = 0.15.
- Source attribution precision/recall SAP/SAR ≈ 0.61 (varies by subset).
- Answer completeness AC = 0.40 (mean).
- Aggregate G = 0.71.
Surprising empirical patterns:
- Very weak cross-stage correlations (parsing and retrieval explain <3% of downstream variance).
- Hallucination vs. context length shows a U-shaped pattern: short and very-long contexts hallucinate more than medium-length ones (directional, not decisive due to small n in extremes).
- Systems tend to be factually correct on claims they make but omit relevant content (high FA vs. low AC).
Known limitations:
- Parsing was held fixed (pre-extracted text), shared generator across pipelines, many automated proxies for human judgment.
- Domain imbalance and query-distribution mismatch with real enterprise workflows.
- Pilot not fully representative of proprietary enterprise documents.

Data & Methods

Dataset construction:
- Public, permissively licensed documents (SEC EDGAR, CUAD, PubMed/PMC, arXiv, patents, etc.).
- Semi-automated QA generation (LLM proposals + two human annotators; disagreements dropped).
- Stratified sampling to reduce domain dominance; IAA target κ ≥ 0.85 for QA generation checks.
- Splits: train/val/test = 10/10/80 (no models trained on data).
Metric validation:
- Parsing sub-metrics validated against public benchmarks (TIS vs. TEDS r = 0.89; TEA on TableBank 0.91 native, 0.68 scanned; LF vs. DocBank r = 0.82).
- Generation entailment pipeline validated (cross-encoder accuracy ~0.85; human entailment agreement ~0.79).
- Hallucination annotation: two raters on 400 (answer, context) pairs, Cohen’s κ = 0.73.
Evaluation setup:
- Three pipelines executed with same GPT-5 generator; parsing fidelity computed against pre-extracted text (Pfidelity = 0.82 constant across pipelines).
- Retrieval-focused Quality composite used for pipeline ranking: Quality = 0.25·Parsing + 0.50·Retrieval + 0.25·Generation (reflects retrieval being the only varied axis).
- Cost model uses April 2026 pricing; reported relative indexing costs: BM25 1.0×, Dense 1.5×, Hybrid 1.6×.
Reproducibility:
- Framework, metrics, baselines, and collection scripts to be open-sourced on acceptance; results are from a pilot release and authors specify planned scaling.

Implications for AI Economics

Cost vs. quality trade-offs are concrete and modestly quantified:
- Simple keyword retrieval (BM25) remains highly cost-effective — near parity with the hybrid in end-to-end Quality (0.84) at 1.0× indexing cost.
- Dense embedding pipelines are costlier (1.5×) and underperform on retrieval relevance in this pilot; hybrids add cost (1.6×) but only modestly improve retrieval over BM25.
- Economic takeaway: investment in dense or hybrid retrieval must be justified by downstream gains beyond nDCG (e.g., re-ranking for multi-modal queries, latency, or specific domain gains).
Weak inter-stage correlations change investment priorities:
- If parsing improvements explain <3% of downstream variance in similar settings, large capital expenditure on marginal parser upgrades may yield small end-user gains unless downstream components or generator behavior are changed simultaneously.
- The near-zero retrieval→generation correlation suggests marginal improvements in ranking metrics (nDCG) do not necessarily translate to better final answers; economic ROI should be evaluated on end-to-end business KPIs (task completion, human review time saved), not intermediate IR metrics alone.
Completeness gap increases operational risk and human supervision costs:
- High factual accuracy but low completeness implies systems frequently omit relevant information, shifting workload to human reviewers to triage/complete answers. This raises recurring labor costs and compliance risk in regulated domains (legal, finance, healthcare).
- Procurement and SLA design should include completeness and coverage metrics (not just correctness or hallucination rates).
Hallucination behavior and context length affect product design and costs:
- The U-shaped hallucination vs. context-length finding suggests there are sweet spots for context assembly; providing too little or too much context may increase hallucination and downstream verification costs.
- Engineering effort to chunk context intelligently or to build length-aware retrieval/generation logic could be cost-effective.
Evaluation design affects perceived value:
- Because the study held parsing and generator fixed, buyers and engineering teams should require end-to-end benchmarks that vary the components they plan to change; otherwise investment decisions could be misinformed.
Practical procurement and staffing recommendations:
- Insist on end-to-end task-based benchmarks in procurement (including completeness and human review time).
- Prioritize investments that demonstrably reduce human-in-the-loop costs (completeness, extractive coverage, reliable attribution) rather than optimizing single-axis metrics.
- Consider hybrid operational architectures (agentic routing, selective heavier pipelines on complex queries) to balance cost and service-level targets — authors propose such reference architectures but haven’t yet measured them end-to-end.
Regulatory and liability considerations:
- High factual accuracy but noncomprehensive answers create unique liability risks (e.g., omitted contract clauses). Economic models for deployment should include expected remediation/inspection costs and potential regulatory penalties.
Research & engineering investments with economic upside:
- Integrate alternative parsers and generator variants in end-to-end testing — likely to change correlation patterns and hence ROI calculations.
- Invest in metrics and tooling that measure completeness and false omissions, not just hallucination or per-claim accuracy.
- Build selective routing/agentic systems to apply costlier resources only where expected marginal benefit is high (the paper provides three reference architectures for this).

Shortcomings to keep in mind when using these results for economic decisions: - Results are from a pilot with pre-extracted text and a shared generator; effects may change when raw PDFs, OCR variance, and alternative generators are included. - Domain and query-distribution in the pilot skew toward academic/technical queries; enterprise-specific ROI may differ. - Many metrics are automated proxies; real-world cost benefits should be validated via human-in-the-loop time-motion studies and task completion rates.

Overall: EnterpriseDocBench shifts the evaluation focus from component-level benchmarks to end-to-end outcomes and surfaces several economically relevant trade-offs (cost multipliers, weak cascades, completeness shortfalls) that should influence procurement, engineering priorities, and ROI evaluations for enterprise document AI deployments.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper presents systematic empirical evaluation across multiple pipeline stages and retrieval methods on a purpose-built corpus, with quantitative metrics (nDCG, hallucination rates, factual accuracy, completeness) and clear baselines; however, it does not establish causal claims, relies on a pilot corpus of permissively licensed documents, uses automated proxy metrics and a single shared generator (GPT-5), and has design caveats that limit external validity. Methods Rigormedium — Methodologically sound benchmarking practices are used (multi-stage metrics, multiple retrieval baselines, reporting of correlations and error rates), but rigor is reduced by fixed parsing, a single generator across pipelines, reliance on automated rather than extensive human evaluation, limited description of dataset size/selection in the summary, and lack of end-to-end integration or robustness checks across alternative generators and parsing strategies. SampleA pilot corpus built from public, permissively licensed documents spanning six enterprise domains (five domains represented in the pilot). Evaluations run on three retrieval pipelines (BM25, dense embedding, hybrid) with a shared GPT-5 generator; metrics reported include parsing fidelity, indexing efficiency, retrieval relevance (nDCG@5), generation groundedness (hallucination, factual accuracy on stated claims, and answer completeness). Exact dataset sizes and full domain breakdown are not specified in the summary. Themesproductivity adoption GeneralizabilityPilot covers only five active domains; may not represent the full diversity of enterprise document types, Corpus uses permissively licensed public documents, which differ from proprietary internal enterprise data (format, style, sensitivity), Parsing stage was fixed in experiments, so results may change with different parsers or document preprocessing, Single generator (GPT-5) used across evaluations; findings may not generalize to other LLMs or settings, Reliance on automated proxy metrics (vs. extensive human evaluation) limits external validity for real-world user outcomes, Only three retrieval architectures tested; other retrieval or reranking methods may perform differently, Not integrated end-to-end with proposed reference architectures, so deployment behaviors (latency, cost, agentic routing) are untested

Claims (13)

Claim	Direction	Confidence	Outcome	Details
Most enterprise document AI today is a pipeline: parse, index, retrieve, generate. Adoption Rate	positive	high	prevalence of pipeline architecture	0.18
We built EnterpriseDocBench to evaluate parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness on the same corpus. Output Quality	positive	high	system-level evaluation across parse/index/retrieve/generate stages	0.18
The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot). Research Productivity	positive	high	dataset domain coverage	six domains (five in pilot) 0.18
We ran three pipelines through it: BM25, dense embedding, and a hybrid, all using the same GPT-5 generator. Research Productivity	positive	high	evaluation of retrieval pipelines with a shared generator	0.18
Hybrid retrieval narrowly beats BM25 (nDCG@5 of 0.92 vs. 0.91). Output Quality	positive	high	retrieval relevance (nDCG@5)	nDCG@5 of 0.92 vs. 0.91 0.18
Both hybrid and BM25 beat dense embedding (dense embedding nDCG@5 = 0.83). Output Quality	positive	high	retrieval relevance (nDCG@5)	dense embedding nDCG@5 = 0.83 (hybrid 0.92, BM25 0.91 reported elsewhere) 0.18
Hallucination rate does not grow monotonically with document length: short documents and very long ones both hallucinate more than medium ones (28.1% and 23.8% vs. 9.2%). Error Rate	negative	high	hallucination rate (fraction of generated outputs judged hallucinated)	28.1% and 23.8% vs. 9.2% 0.18
Cross-stage correlations are very weak: parsing->retrieval r = 0.14, parsing->generation r = 0.17, retrieval->generation r = 0.02. Output Quality	null_result	high	correlation between stage-level quality metrics	r=0.14, r=0.17, r=0.02 0.18
Factual accuracy on stated claims is 85.5%. Output Quality	positive	high	factual accuracy (fraction of stated claims judged factually correct)	85.5% 0.18
Answer completeness averages 0.40. Output Quality	negative	high	answer completeness (average completeness score)	0.40 average completeness 0.18
The system tends to be factually correct when it answers but often omits information (i.e., 'the system is right when it answers — it just leaves things out'). Output Quality	mixed	high	factual accuracy vs. answer completeness	0.18
The paper describes three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) which are not yet integrated end-to-end. Other	positive	high	proposed system architectures (descriptive)	three architectures named 0.09
Framework, metrics, baselines, and collection scripts will be released open-source on acceptance. Research Productivity	positive	high	open-source release of materials	0.09