The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

State-of-the-art LLMs silently damage documents in long delegated workflows: even top models corrupt roughly a quarter of content by the end, with errors compounding over time and worsening for large files or distractions; using agents or tools does not reliably prevent the decay.

LLMs Corrupt Your Documents When You Delegate
Philippe Laban, Tobias Schnabel, Jennifer Neville · April 17, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
In long simulated delegated workflows across 52 professional domains, current LLMs frequently and silently corrupt documents—frontier models corrupt around 25% of content on average by workflow end—and errors compound with longer interactions, larger documents, or distractors, while agentic tool use does not reliably mitigate degradation.

Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.

Summary

Main Finding

Delegating multi-step document editing to current LLMs silently corrupts documents over time. In DELEGATE-52 (52 professional domains; 310 real work environments), a large-scale simulation with 19 LLMs shows that even frontier models (e.g., Gemini 3.1 Pro, Claude 4.6 Opus, GPT‑5.4) lose on average ~25% of document content after 20 delegated interactions; across all tested models average degradation is ~50%. Errors are typically sparse but severe and compound over long interaction chains.

Key Points

  • Benchmark: DELEGATE-52 — 52 domains across 5 categories (Science & Engineering, Code & Configuration, Creative & Media, Structured Records, Everyday), 310 work environments using real seed documents (2–5k tokens) plus distractor context (8–12k tokens).
  • Evaluation primitive: round‑trip backtranslation (forward edit σ, inverse edit σ−1). A perfect model reconstructs the original document; reconstruction score RS@k = sim(s, ŝ) measures degradation.
  • Domain-specific scoring: each domain has a parser and weighted similarity components (e.g., ingredients/steps for recipes) to detect semantic corruption that generic text-similarity or LLM judges miss.
  • Main experimental setup: 10 round-trips (20 single-turn interactions) per environment, with all environment files provided in context each turn.
  • Results:
    • All models degrade over interactions; frontier models average ≈25% content loss at 20 interactions; mean across models ≈50%.
    • Domain heterogeneity: programmatic domains (e.g., Python, databases) are relatively robust; Python is the only domain where most models (17/19) are “ready” (RS@20 ≥ 98%).
    • Short-horizon performance is not predictive of long-horizon reliability. Example: GPT‑5 and Kimi K2.5 are similar after 2 interactions but diverge substantially by 20.
  • Agent/tool experiments: a basic agentic harness with file read/write and code execution made performance worse on average (~+6% additional degradation), likely because tool mediation incurred more tokens/interactions (8–12 tool calls per task; 2–5× more input tokens).
  • Sensitivity: degradation increases with document size, interaction length, and presence of distractor files; these effects compound over time.
  • Robustness checks: authors validated that LLMs actually attempt the transformations (not taking trivial shortcuts) and that the domain-specific metrics are necessary to capture nuanced semantic corruption.

Data & Methods

  • Dataset: 310 work environments built from real documents with permissive licenses; each environment contains a seed document, 5–10 paired forward/backward edit tasks (invertible edits), and distractor documents.
  • Round‑trip relay simulation:
    • Single-turn LLM sessions: forward edit → produce transformed document; then backward edit → reconstruct document. Repeat chained pairs to simulate long workflows.
    • RS@k computed after k single-turn interactions (k/2 round-trips).
  • Domain scoring: custom parsers convert raw files to structured representations; similarity functions compare components with calibrated weights to prioritize semantically important elements (e.g., numeric quantities, ordering when relevant).
  • Models evaluated: 19 LLMs spanning multiple families and scales (OpenAI GPT series including GPT‑5.x, Anthropic Claude 4.6 variants, Google Gemini 3.x, Mistral Large 3, xAI Grok 4, Moonshot Kimi K2.5, etc.). Agentic harness experiments used a simple file/tool API (not optimized).
  • Validation: quality assurance for parsing, evaluation sensitivity, edit difficulty; experiments showing generic similarity measures and LLM-as-judge metrics correlate poorly with domain-specific metrics.

Implications for AI Economics

  • Adoption frictions and trust costs:
    • Delegation value depends on trust that AI will not silently corrupt assets. High error rates in long-horizon delegation raise transaction costs for adoption in many knowledge-work domains.
    • Organizations will likely incur increased monitoring, verification, and human-inspection costs, reducing net productivity gains from delegation.
  • Market demand shifts:
    • Increased demand for verification, auditing, and provenance tools (domain-specific validators, automated round‑trip checks, cryptographic integrity tools). These create new product markets and service lines (audit-as-a-service, AI-assisted diff/merge verification).
    • Growth in specialist vendors and consultancies providing domain parsers, robust toolchains, and SLAs for delegated editing in high-stakes domains (finance, scientific data, legal, engineering).
  • Pricing, contracting, and liability:
    • Buyers will price in reliability risk (discounts, holdbacks) and require contractual protections (warranties, SLAs, liability clauses). Insurance markets for AI-induced document corruption may expand.
    • Firms offering LLM-based delegation will face higher compliance and indemnity costs; pricing of premium, audited models/services may include costs for long-horizon testing and domain-specific validation.
  • Labor and comparative advantage:
    • In programmatic domains (e.g., coding), delegation appears closer to safe substitution; in many specialized or niche domains, human expertise remains necessary for oversight. This implies heterogeneous displacement/substitution effects across occupations.
    • Demand for roles that combine domain expertise and AI‑verification skills (AI auditors, prompt engineers with domain parsing know-how) will increase.
  • Investment and R&D priorities:
    • Economic incentives should favor investments in long-horizon reliability (benchmarks like DELEGATE-52), domain-aware parsers, and tooling that modifies files minimally (targeted edits rather than full regeneration).
    • Research and product development that reduces token overhead in tool-mediated workflows and that prevents compounding errors will be commercially valuable.
  • Regulation and standards:
    • Regulators and industry bodies may require long-horizon evaluation and domain-specific validation benchmarks for systems deployed in regulated sectors.
    • Standards for provenance, versioning, and machine-readable attestations of edits could become required controls.
  • Operational mitigation (affects cost structures):
    • Companies will likely adopt processes: mandatory version control snapshots, automatic round‑trip checks, human sign-off for critical edits, limited delegation horizons, and conservative use of agentic tooling.
    • These mitigations increase operational overhead and slow the pace of automated delegation rollout, particularly for high-value assets.

Summary takeaway: DELEGATE-52 shows that current LLMs are not yet reliable long‑horizon delegates across most professional domains. From an AI economics perspective, this implies substantial additional costs (monitoring, verification, liability), opportunities for new verification markets and services, uneven labor impacts across occupations, and a commercial premium for models and tools that demonstrably support long‑horizon, domain-specific correctness.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper reports a large-scale, systematic empirical benchmark across 19 LLMs and 52 professional domains showing consistent failure modes (document corruption) — providing convincing descriptive evidence that current models struggle in long delegated workflows. However, it does not connect these failures to real-world economic outcomes (productivity, wages, firm performance), relies on simulated workflows rather than live user studies, and its corruption metrics and task framing may not capture all real-world use cases, so the evidence stops short of high-strength causal claims about economic impact. Methods Rigormedium — Strengths: broad model coverage (including frontier models), many professional domains (DELEGATE-52), systematic experiments varying document size, interaction length, distractors, and agentic tool use. Limitations: simulated interactions rather than field deployment or human-in-the-loop experiments, potential subjectivity or narrowness in how 'corruption' is defined and measured, possible sensitivity to prompt engineering and implementation details of agentic/tool setups, and limited transparency about labelling/validation protocols. SampleDELEGATE-52 benchmark: simulated long delegated editing workflows across 52 professional domains (e.g., coding, crystallography, music notation), evaluated on 19 large language models including frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4); experiments measure fraction of document content corrupted over multi-step interactions and test effects of document size, interaction length, distractor files, and agentic/tool-using pipelines. Themesproductivity human_ai_collab adoption GeneralizabilitySimulated workflows may not match real-world human-AI delegation patterns and corrective behavior, 52 domains broad but not exhaustive of knowledge work; domain-specific results may not generalize, Results tied to specific model versions and system prompts; future model updates could change outcomes, Definition and measurement of 'document corruption' may not map directly to productivity or economic costs, Agent/tool implementations tested may differ from deployed agent systems in practice

Claims (7)

ClaimDirectionConfidenceOutcomeDetails
We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows; DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains (e.g., coding, crystallography, and music notation). Other positive high benchmark scope / domain coverage
n=52
0.18
Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation. Output Quality negative high document degradation / output quality
n=19
0.18
Even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows. Error Rate negative high proportion of document content corrupted
n=19
25% of document content
0.18
Other models fail more severely (i.e., worse than the frontier models mentioned). Error Rate negative medium document corruption / output quality
n=19
0.11
Agentic tool use does not improve performance on DELEGATE-52. Output Quality negative high task performance on DELEGATE-52 (document quality/corruption)
0.18
Degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Error Rate negative high severity of document degradation / error rate
0.18
Current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction. Error Rate negative high error severity and silent corruption over time
n=19
0.18

Notes