Improved LLMs and a systematic audit reveal ELT automation is far further along than ELT-Bench originally suggested: extraction and loading are largely solved and many supposed transformation failures reflected benchmark mistakes rather than model limitations.

ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

Christopher Zanoli, Andrea Giovannini, Tengjun Jin, Ana Klimovic, Yotam Perlitz · March 31, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Re-evaluating ELT-Bench with stronger LLMs and a systematic Auditor-Corrector audit shows extraction/loading is largely solved and many reported transformation failures were caused by benchmark errors, so corrected evaluation substantially raises measured agent performance.

Constructing Extract-Load-Transform (ELT) pipelines is a labor-intensive data engineering task and a high-impact target for AI automation. On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. We revisit these results and identify two factors causing a substantial underestimation of agent capabilities. First, re-evaluating ELT-Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss' kappa = 0.85) to audit benchmark quality. Applying this to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors -- including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth -- that penalize correct agent outputs. Based on these findings, we construct ELT-Bench-Verified, a revised benchmark with refined evaluation logic and corrected ground truth. Re-evaluating on this version yields significant improvement attributable entirely to benchmark correction. Our results show that both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities. More broadly, our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation. Systematic quality auditing should be standard practice for complex agentic tasks. We release ELT-Bench-Verified to provide a more reliable foundation for progress in AI-driven data engineering automation.

Summary

Main Finding

Upgrading only the underlying LLM and fixing benchmark quality substantially raise measured AI agent performance on ELT pipeline tasks. Extraction/loading moved from 37% to 96% success by switching Claude Sonnet 3.5 → 4.5. Transformation success rose from 1% → 22.66% with the model upgrade and further to 32.51% after correcting pervasive benchmark errors (ELT-Bench → ELT-Bench-Verified). A systematic Auditor–Corrector audit shows many transformation "failures" were attributable to the benchmark itself, not agent shortcomings.

Key Points

Benchmark: ELT-Bench evaluates end-to-end ELT pipeline construction (100 tasks, 203 target data models). Metrics: SRDEL (extract/load success) and SRDT (transform success; column-level matching).
Model upgrade effect (same agent framework SWE-Agent):
- SRDEL: 37% → 96% (Claude Sonnet 3.5 → 4.5).
- SRDT: 1% → 22.66% with model upgrade.
Auditor–Corrector audit:
- Scope: 81 tasks that produced transformation outputs but failed; 660 unmatched columns across 136 data models.
- LLM-driven root-cause analysis (Claude Opus 4.5) produced corrected SQL & evidence for each unmatched column; human verification followed.
- Human validation: three annotators; Fleiss’ κ = 0.85 for high-level attribution, κ = 0.755 for the full 14-category taxonomy.
- Findings: 82.7% of the 81 failed tasks contained at least one benchmark-attributable error; at the column level, 33.0% of mismatches were attributable to benchmark problems rather than agent errors.
- Error types included overly rigid evaluation scripts, ambiguous specifications, and incorrect ground-truth calculations.
Corrector outcome: ELT-Bench-Verified (refined evaluation logic, removal of unreliable ground-truth columns) raised SRDT from 22.66% → 32.51% for the same agent+model (a 43.5% relative improvement attributable to benchmark correction).
Practical costs: running SWE-Agent across 100 tasks took ~2 days 7 hours and ~$343 in API fees; baseline (ReAct) run ~18 hours and $293 — highlighting nontrivial evaluation costs.
Resource: ELT-Bench-Verified has been released to the community.

Data & Methods

Benchmark data: ELT-Bench (100 ELT tasks; starter codebase, source schemas, target data_model.yaml, ground-truth CSVs).
Agent/framework: SWE-Agent used to construct Terraform/Airbyte config, run syncs into Snowflake, author dbt SQL models, and execute dbt runs.
Models evaluated: Claude Sonnet 3.5 (original), Claude Sonnet 4.5 (upgrade); Auditor used Claude Opus 4.5 for automated root-cause analysis.
Auditor pipeline:
- Phase 1: build per-task analysis environment with task spec, gt/.csv, predicted/.sql & .csv, source data.
- Phase 2: spawn one LLM agent per task; agent inspects artifacts, hypothesizes root cause, derives corrected SQL, executes it against source data to verify exact matches; produced 660 analysis reports covering all unmatched columns (630/660 had self-validated 100% row-level matches).
- Phase 3: manual inspection & categorization of each report into a 14-category taxonomy along two axes: attribution (agent vs benchmark) and mitigability. Independent annotator study to confirm reproducibility.
Corrector pipeline: translate categorizations into benchmark fixes (patch eval scripts, remove unreliable GT columns) without adding biases; produce ELT-Bench-Verified.

Implications for AI Economics

Measurement and valuation bias: Benchmarks drive perceptions of AI progress. Systematic benchmark errors can dramatically understate capabilities, leading to underinvestment in automation, conservative adoption decisions, and mispriced labor substitution risks in data engineering.
Rapid capability growth vs. stale benchmarks: The large performance jump from a single model upgrade (SRDEL 37→96%) shows that benchmark snapshots can quickly become obsolete. Economic forecasts and business strategies relying on benchmark-based capability estimates should account for rapid model improvements.
Incentives and research focus: Benchmark-driven incentives (paper evaluations, leaderboards, procurement specs) can misallocate effort if evaluations are noisy or biased. High-quality, audited benchmarks redirect research and product efforts toward genuine gaps rather than artifacts of evaluation.
Cost of rigorous evaluation: The Auditor–Corrector approach adds overhead (LLM compute + human verification). Economically, this implies higher measurement costs for reliable assessment, which stakeholders (funders, labs, standard-setters) must internalize to avoid systematic mismeasurement.
Policy & procurement: Procurement specifications, regulatory assessments, and workforce planning that use benchmarked capabilities should require audited benchmarks or uncertainty margins. Overly pessimistic benchmarks may delay useful automation; overly optimistic, un-audited ones can cause premature replacement decisions.
Market signaling & adoption: The finding that extraction/loading is "largely solved" (for covered sources/configs) suggests near-term commercially viable automation for parts of data engineering workflows. Firms and labor markets should prepare for heterogenous automation: routine EL tasks are more automatable than complex transformations sensitive to ambiguous specs.
Standardization and governance: The study argues for standard auditing practice for multi-step, execution-based benchmarks. From an economic standpoint, industry-wide benchmark governance (auditing protocols, reproducible evaluation code, and ground-truth curation) reduces information asymmetries and improves capital allocation decisions.
Broader systemic concern: Similar annotation and evaluation errors appear in related benchmarks (e.g., text-to-SQL). This indicates systemic risk of mismeasurement across data engineering and ML tasks — affecting many downstream economic decisions.

Takeaway: Reliable measurement matters. For stakeholders making investment, hiring, or automation decisions, audited benchmarks like ELT-Bench-Verified provide a more accurate, and often more optimistic, picture of AI capabilities in data engineering — but accurate measurement requires deliberate audit effort and ongoing maintenance as models rapidly improve.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides systematic re-evaluation of a single benchmark using upgraded LLMs and a structured Auditor-Corrector audit with strong inter-annotator agreement (Fleiss' kappa = 0.85), showing that many prior failures were due to benchmark errors; however, evidence is limited to one benchmark and model family, lacks external validation on real-world production pipelines or measured economic outcomes, and may not fully rule out remaining benchmark or evaluation biases. Methods Rigorhigh — Authors combine automated LLM-driven root-cause analysis with human validation, report a high inter-annotator reliability, produce corrected evaluation logic and ground truth, and re-run evaluations to quantify the impact of corrections — a systematic and reproducible approach; remaining limitations include potential selection bias in audited tasks and dependency on annotator judgments for corrections. SampleELT-Bench — a benchmark of end-to-end ELT (extract-load-transform) pipeline construction tasks spanning extraction, loading, and transformation subtasks; re-evaluated using upgraded large language models and audited via the proposed Auditor-Corrector workflow with human validators (reported Fleiss' kappa = 0.85). Themesproductivity adoption GeneralizabilityResults apply to the ELT-Bench dataset and similar synthetic/benchmarked ELT tasks, not necessarily to all real-world data engineering pipelines., Findings depend on the specific LLM versions, prompts and toolchains used — other models or deployments may perform differently., Corrected benchmark logic may not generalize to other benchmarks or task formulations; similar audit effort may be required elsewhere., Paper does not measure downstream economic outcomes (productivity gains, time saved, labor displacement, wages) in field settings., Auditor-Corrector relies on human validation; scalability and cost of widespread auditing are uncertain.

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Constructing Extract-Load-Transform (ELT) pipelines is a labor-intensive data engineering task and a high-impact target for AI automation. Automation Exposure	positive	high	labor intensity and suitability for AI automation (qualitative claim)	0.09
On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. Developer Productivity	negative	high	agent success rate on ELT-Bench (agent capability / practical utility)	0.18
Re-evaluating ELT-Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Developer Productivity	positive	high	performance on extraction/loading and transformation stages of ELT pipeline construction	0.18
We develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss' kappa = 0.85) to audit benchmark quality. Other	positive	high	benchmark audit reliability as measured by inter-annotator agreement (Fleiss' kappa)	Fleiss' kappa = 0.85 0.18
Applying the Auditor-Corrector methodology to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors — including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth — that penalize correct agent outputs. Other	negative	high	proportion of failed transformation tasks attributable to benchmark errors (qualitative claim)	0.18
Based on these findings, we construct ELT-Bench-Verified, a revised benchmark with refined evaluation logic and corrected ground truth. Other	positive	high	existence of a revised benchmark with corrected evaluation and ground truth	0.09
Re-evaluating on ELT-Bench-Verified yields significant improvement attributable entirely to benchmark correction. Other	positive	high	change in agent performance after benchmark correction	0.18
Both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities. Other	mixed	high	factors contributing to underestimation of agent capabilities (model improvement vs. benchmark quality)	0.18
Our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation. Other	negative	medium	presence of systemic annotation/benchmark quality issues across data engineering evaluation benchmarks	0.05
Systematic quality auditing should be standard practice for complex agentic tasks. Governance And Regulation	positive	high	recommendation for adoption of systematic quality auditing (policy/practice proposal)	0.03