The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Compressing developer logs to save token costs backfires for agentic LLMs: a lab experiment found 17% fewer input tokens but a 67% increase in total session cost as compressed formats shifted work into costly model reasoning; preserving semantically dense tokens or using tool-assisted compression avoids the penalty.

Beyond Human-Readable: Rethinking Software Engineering Conventions for the Agentic Development Era
Dmytro Ustynov · April 08, 2026
arxiv quasi_experimental medium evidence 7/10 relevance Source PDF
A controlled experiment shows that aggressively compressing log formats to save input tokens for agentic LLMs can paradoxically raise total session costs by 67% because it shifts interpretive work into the model's expensive reasoning phase.

For six decades, software engineering principles have been optimized for a single consumer: the human developer. The rise of agentic AI development, where LLM-based agents autonomously read, write, navigate, and debug codebases, introduces a new primary consumer with fundamentally different constraints. This paper presents a systematic analysis of human-centric conventions under agentic pressure and proposes a key design principle: semantic density optimization, eliminating tokens that carry zero information while preserving tokens that carry high semantic value. We validate this principle through a controlled experiment on log format token economy across four conditions (human-readable, structured, compressed, and tool-assisted compressed), demonstrating a counterintuitive finding: aggressive compression increased total session cost by 67% despite reducing input tokens by 17%, because it shifted interpretive burden to the model's reasoning phase. We extend this principle to propose the rehabilitation of classical anti-patterns, introduce the program skeleton concept for agentic code navigation, and argue for a fundamental decoupling of semantic intent from human-readable representation.

Summary

Main Finding

Agentic software consumers (LLM-based agents) change the optimal design trade-offs: instead of minimizing raw token counts, projects should optimize semantic density — maximize the ratio of task-relevant information to total tokens. Aggressive token compression that removes meaningful content can increase total task cost (input + model reasoning + tool calls). A controlled log-format experiment shows a counterintuitive "compression paradox": a 17.1% reduction in input tokens produced a 67.2% increase in total session tokens because the model spent many more tokens in reasoning/decoding.

Key Points

  • Semantic density principle: keep high-information tokens (descriptive names, type annotations, docstrings, diagnostic messages) and eliminate zero-information tokens (ceremonial boilerplate, redundant scaffolding). Compressing high-information tokens is often counterproductive.
  • Taxonomy of conventions under agentic pressure:
    • File splitting: agents pay per file/tool-call; consolidate by deployment/test boundaries rather than human working-memory limits.
    • Naming: richer, descriptive names increase value for agents (they act like compressed documentation).
    • Abstraction/ceremony: deep hierarchies and heavy framework ceremony create many zero-information tokens and extra tool calls.
    • Anti-patterns: some classical anti-patterns (e.g., larger files, God objects) may have a changed cost/benefit trade-off for agents, but risks (attention degradation, distributional mismatch) remain.
    • SOLID: some principles weaken (SRP, DRY), others (testability, DIP with revised mechanisms) retain value.
    • Logging & commits: verbose, semantically rich logs and commit messages reduce agent effort.
  • Program skeleton (suggested filename CODEMAP.md): a lossy, committed artifact containing module topology, entry points, call chains, function signatures and one-line docstrings — intended to give agents persistent structural knowledge across sessions and reduce repeated rebuild costs.
  • Compression paradox: input token savings can shift burden to expensive reasoning/context tokens and extra tool calls; tool-assisted decompression can help but introduces call overhead and complexity.
  • Limitations: single model (claude-sonnet-4-6), single dataset (200 log events), linear retrieval task (logs) — architectural claims remain hypotheses needing further controlled experiments. Training-distribution and attention-degradation risks are important caveats.
  • Future work proposed: measure crossover points (when compressed+tool is better), skeleton evaluation (no skeleton vs. human vs. agent-generated), file consolidation experiments, formal ceremony-vs-logic measurement, and LLM-native language design.

Data & Methods

  • Experiment domain: 200 synthetic/realistic log events modeling a 30-minute e-commerce window (HTTP requests, DB queries, auth, business logic, errors).
  • Four formats tested:
    • A — Human-Readable (natural language timestamps, full names)
    • B — Structured (pipe-delimited, Unix timestamps, key-value)
    • C — Compressed (abbreviated codes, compact schema)
    • D — Compressed + Decoder Tool (C plus a Python decoder script used via targeted tool calls)
  • Tokenization: cl100k_base (tiktoken).
  • Setup: five identical diagnostic questions to claude-sonnet-4-6 in isolated fresh sessions (15.1k baseline tokens each), no extended think time.
  • Key measured outcomes (selected):
    • File-level tokens: A=8,072; B=7,106 (−12% vs A); C=6,695 (−17.1% vs A).
    • Session tokens (messages total): A=18.9k; B=24.0k; C=31.6k; D=28.3k.
    • Wall-clock time: A=1m36s; B=5m24s; C=7m00s; D=4m05s.
    • Tool calls: A/B/C = 1; D ≈ 5–7 (decoder invocations).
    • Correctness (out of 5): all formats 5/5; high-confidence judgments fell for compressed formats.
  • Findings:
    • Compression reduced input tokens but increased total session tokens (higher reasoning + output tokens).
    • Tool-assisted decompression reduced reasoning compared to raw compressed, but tool-call overhead and operational fragility (execution errors, extra detours) reduced net gains at the tested scale.
    • For moderate-scale logs, human-readable/structured formats yielded lower total cost than aggressive compression.

Implications for AI Economics

  • Cost modeling must move beyond input-token minimization:
    • Total task cost = input tokens + reasoning tokens + output tokens + tool-call overhead + human-review labor. Compression can reduce input but inflate reasoning/output tokens, raising operating expenditure (OPEX) per task.
    • Pricing by input tokens alone underestimates true cost exposure when compression forces more model reasoning.
  • Token budgets and context management are scarce economic resources:
    • Context window limits (200K–1M tokens typical; performance degradation beyond ~40% utilization reported) make semantic density a constrained-resource optimization problem. Investments that increase semantic density per context slot have diminishing marginal cost and can delay expensive scaling (bigger models / more context).
  • Tool-call economics and latency:
    • Tool-assisted decompression or skeleton lookups help but add fixed-per-call overhead and failure modes. At small-to-moderate artifact sizes, per-call overhead can negate token savings; at very large scales, tool-assisted selective access may become cost-effective. Estimating the crossover point is critical for cost-benefit decisions.
  • Labor and cognitive-debt externalities:
    • Optimizations for agents transfer cognitive load to human reviewers; total system cost must include human review time and error-correction costs. Economic decisions (e.g., file consolidation) should weigh agent efficiency gains against increased human-review costs and potential slower onboarding.
  • Training-distribution and model performance risk:
    • Changing conventions (e.g., consolidated large files, God objects, new skeleton artifact formats) can induce distributional drift relative to models' training data, raising hallucination risk and potentially increasing the need for retraining, fine-tuning, or domain-adaptive layers — all substantial capital/operational expenses.
  • Product and market opportunities:
    • IDE vendors, platform teams, and tooling companies can capture value by producing and standardizing skeleton artifacts, projection layers (dual human/machine views), token-efficient serialization formats (semantic-density preserving), and robust tool-invocation infrastructures to amortize per-call costs.
    • Pricing models for LLM services and observability platforms should consider per-session reasoning costs and tool-call overheads, possibly offering bundles (skeleton storage + efficient retrieval) or new SLAs tied to semantic-density optimizations.
  • Recommendations for practitioners and decision-makers:
    • Prioritize semantic density over raw token minimization; keep descriptive names, diagnostics, and docstrings.
    • Instrument and measure total-session token usage (not just input) and wall-clock/tool-call costs to inform engineering trade-offs and ROI.
    • Run controlled experiments to find the scale crossover where compressed + tool (or other selective-access) becomes net-cost-effective.
    • Account for human-review and retraining costs when changing conventions; plan for mitigation (projection layers, reviewer tooling).
  • Research & policy investment priorities:
    • Fund empirical studies to compute break-even/crossover points across workloads and models.
    • Standardize skeleton/metadata formats (persistent across sessions) to reduce repeated comprehension costs and thereby OPEX.
    • Evaluate macro-level impacts: if industry adopts semantic-density-first conventions, model providers will see shifts in training data distribution and in downstream inference loads — this could affect model architecture, pricing, and market dynamics.

Summary takeaway: For agentic development, optimize for semantic density (high-information tokens retained, zero-information ceremony removed). Doing so reduces total economic cost only if decisions are informed by whole-session accounting (input + reasoning + tool calls + human review). Short-term compression can be a false economy; tooling and investment decisions should be guided by measured crossover points and by internalizing the human-review and retraining externalities.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper uses an experimental manipulation that yields direct causal leverage on how log-format affects agent cost and behavior, and reports large effect sizes (67% higher session cost). However, key details that determine external credibility—sample size, number and variety of models, task diversity, randomization/blinding procedures, statistical inference and robustness checks—are not specified, limiting confidence in generality and precision of the estimates. Methods Rigormedium — The study design (multiple controlled conditions) is appropriate to test the stated hypothesis and the finding is internally plausible, but rigor is uncertain because the abstract omits crucial methodological details (sample sizes, randomization protocol, pre-registration, statistical tests, multiple-model replication, and sensitivity analyses). Without those, potential threats (selection bias, model-specific effects, task heterogeneity, pricing assumptions) are not fully addressed. SampleLLM-based agent sessions interacting with codebases under four log-format conditions, measuring input tokens, reasoning-phase tokens, and total session monetary cost; the abstract does not report the number of sessions, specific models or model versions used, task sets, or diversity of codebases. Themeshuman_ai_collab productivity org_design IdentificationControlled laboratory experiment that manipulates log-format (four treatment arms: human-readable, structured, compressed, tool-assisted compressed) and compares downstream outcomes (input tokens, reasoning tokens, total session cost) across arms; causal claims rest on the experimental treatment assignment (unclear from abstract whether assignment was randomized or whether blocking/stratification, blinding, or clustering adjustments were used). GeneralizabilityResults may be specific to the particular model(s) and versions tested (model architecture and inference/pricing details can change outcomes)., Findings derive from a narrow experimental task (log-format/tokenization for agentic code navigation) and may not generalize to other software engineering tasks or domains., Monetary cost conclusions depend on current API pricing and tokenization schemes, which vary across providers and over time., Lab-style agent sessions may not reflect production, multi-agent, or long-running workflows where amortization and caching change cost dynamics., Unclear whether the study covers multilingual codebases, different programming languages, or varied developer conventions, limiting generalizability across ecosystems.

Claims (9)

ClaimDirectionConfidenceOutcomeDetails
For six decades, software engineering principles have been optimized for a single consumer: the human developer. Developer Productivity null_result high orientation of software engineering design towards human developers
0.08
The rise of agentic AI development, where LLM-based agents autonomously read, write, navigate, and debug codebases, introduces a new primary consumer with fundamentally different constraints. Developer Productivity mixed high who/what is the primary consumer of software engineering artifacts (human developer vs. agentic AI)
0.08
We propose a key design principle: semantic density optimization, eliminating tokens that carry zero information while preserving tokens that carry high semantic value. Developer Productivity positive high information/content efficiency of token representations for agentic consumers
0.08
We validate this principle through a controlled experiment on log format token economy across four conditions (human-readable, structured, compressed, and tool-assisted compressed). Organizational Efficiency null_result high performance on log-format token economy under different formatting conditions
0.48
Aggressive compression increased total session cost by 67% despite reducing input tokens by 17%, because it shifted interpretive burden to the model's reasoning phase. Organizational Efficiency negative high total session cost (primary) and input token count (secondary)
67% increase (total session cost); 17% reduction (input tokens)
0.48
Aggressive compression reduced input tokens by 17%. Organizational Efficiency positive high input token count
17% reduction
0.48
Because aggressive compression shifts interpretive burden to the model's reasoning phase, aggressive token compression can paradoxically increase overall cost. Organizational Efficiency negative medium distribution of computational/interpretive workload between input processing and model reasoning; overall cost
0.14
We extend the semantic density principle to propose rehabilitation of classical anti-patterns and introduce the program skeleton concept for agentic code navigation. Developer Productivity positive high suitability of classical anti-patterns and program skeletons for agentic navigation
0.08
The paper argues for a fundamental decoupling of semantic intent from human-readable representation. Developer Productivity positive high alignment between semantic intent encoding and human-readable formats
0.08

Notes