A four-month case study of a persistent research agent shows workflows are cache-dominated—82.9% of recorded tokens were cache reads—implying the economic unit may shift from cost-per-token to cost-per-completed-artifact.
Background: Large language models are typically evaluated as models, benchmarks, or short conversational episodes. Less is known about what happens when an agent is embedded persistently in a real academic research environment with durable memory, local files, external tools, scheduled routines, delegated roles, and explicit safety protocols. Methods: A structured self-observed implementation case study was conducted from January 31 to May 25, 2026. The unit of analysis was the persistent human-agent environment: researcher, agent runtime, memory layer, tools, repositories, scheduled jobs, specialized agent roles, and governance rules. Outcomes were organized using PARE-M (Persistent Agentic Research Environment Measurement), a measurement framework covering architecture, utilization, artifact production, resource use, reproducibility, and governance. Results: Recoverable main-agent telemetry contained 75,671 de-duplicated records across 96 active days, with 8,059 user-role and 23,710 assistant-role messages. The workspace included 502 memory-related files, 17 configured agent directories, and 57 skill files. Active system time was 579.7 hours (30-minute capped-gap estimate). Memory-derived records identified 482 output-proxy events and 889 failure, verification, correction, or protocol-proxy events. A strict May 2026 trajectory subset captured 627 model-completed events and 73.95 million recorded tokens, of which 82.9% were cache reads. Conclusions: The workflow was cache-dominant, suggesting that persistent agentic environments may shift the economic unit from cost per token to cost per completed artifact. Future evaluations should use artifact-level denominators, reproducible parsing rules, correction taxonomies, and independent coding of governance events.
Summary
Main Finding
A single-investigator, 115-day implementation case shows that embedding a persistent AI agent into an academic research workspace produces a cache-dominant, stateful research infrastructure that expands capacity and scope of activity. Token telemetry (May 1–25 subset) was 82.9% cache reads, implying the economic unit of interest may shift from "cost per token" to "cost per completed artifact" (or cost per verified action) for persistent agentic workflows.
Key Points
- Study design
- Structured self-observed implementation case (Jan 31–May 25, 2026). Unit of analysis = the human + agent runtime + memory + tools + repos + scheduled jobs + governance rules.
- Introduces PARE-M v0.1, a measurement framework separating utilization, output, resource, reproducibility, and governance domains.
- Utilization and activity (recoverable telemetry, Feb 2–May 25)
- De-duplicated records (DRC): 75,671 across 96 active days (Active-day fraction, ADF = 0.835).
- User-role messages: 8,059; Assistant-role messages: 23,710.
- Tool-result messages: 18,596; Tool-call events: 2,385.
- Model-completed events (broad telemetry): 1,286.
- Active system time (ATE): 579.7 hours (30-min capped-gap rule); sensitivity 674.1 hours (60-min cap). These are system-activity estimates, not human labor hours.
- Outputs and governance
- Memory-recorded output-proxy events: 482 (output-proxy rate, OPR = 5.02 per active day).
- Governance/failure/verification/correction proxies: 889 (governance-event rate, GER = 9.26 per active day).
- Artifact-surface breadth (ASB): 10 categories spanning manuscripts, teaching artifacts, software, operations, etc.
- Resource & token profile (strict May 1–25 model-completed trajectory subset)
- Model-completed events (strict subset): 627.
- Total recorded tokens: 73,950,305.
- Cache-read tokens: 61,278,669 (82.9% → cache-dominance ratio, CDR = 0.829).
- Input tokens: 10,697,394; Output tokens: 754,633; Cache-write tokens: 1,219,609.
- Interpretation: majority of token activity was cache reads (reused context), not fresh generation.
- Infrastructure & spend
- Workspace inventory highlights: 502 memory-related files; 17 configured agent directories; 57 skill files; thousands of session/JSONL-like files.
- Runtime: DigitalOcean Droplet (4 vCPU; 7.8 GiB RAM; 154 GiB disk). Verified local compressed backup ~5.4 GB.
- Preliminary observed direct spend ≈ US$1,961 (invoice reconciliation incomplete).
- Behavioral/organizational observation
- Persistent memory accumulated safety rules, lessons, and protocol checks; governance became embedded in the workspace rather than an external addendum.
- Evidence of capacity expansion (scope of delegated work increased) rather than demonstrated labor substitution or measured productivity gains.
- Limitations (major)
- Single-investigator, self-observed case; the author was also system designer, user, data source, and analyst.
- No control group, no baseline, partial telemetry windows (e.g., token telemetry limited to May 1–25), and author-coded governance events lacking independent validation.
- File counts and session artifacts heterogeneous and not directly commensurable with completed outputs.
Data & Methods
- Study type: Structured implementation case study, Jan 31–May 25, 2026; recoverable telemetry begins Feb 2.
- Unit of analysis: Persistent agentic research environment (human + runtime + memory + tools + repos + jobs + governance).
- Data sources: recoverable session telemetry (JSONL-like), memory files, repository & file-system inventories, model-use logs, decision logs, documented protocols, backup artifacts.
- Key operational definitions and processing rules
- De-duplicated record: unique event after de-duplication (hashing timestamp, role, event type, content prefix, tool name when needed).
- Active day: calendar day with ≥1 recoverable main-agent event.
- Output-proxy event: dated memory entry indicating completion/delivery of a deliverable.
- Governance-proxy events: failure, verification, correction, or protocol lessons recorded in memory.
- Active-time estimate (ATE): sum of consecutive timestamp gaps capped at 30 minutes (primary) with a 60-minute cap sensitivity.
- Cache-dominance ratio (CDR): cache-read tokens / total recorded tokens (input + output + cache-read + cache-write) over the strict trajectory subset.
- Measurement framework: PARE-M v0.1 requires numerator, denominator, time window, and computation rule for each metric (examples: ADF, DRC, ATE, CDR, OPR, GER, ASB).
- Reproducibility: Data schema, parsing rules, de-duplication logic, active-time algorithm, and file-classification rules are reproducible in principle; raw conversations and sensitive materials withheld for privacy. A de-identified event ledger and parsing scripts are being prepared.
Implications for AI Economics
- Shifts in the meaningful pricing denominator
- Cache-dominant persistent workflows imply token-based pricing/metrics become less informative; the economically relevant unit may be cost per completed artifact, per verified workflow, or per avoided correction.
- Billing/contracting that charges strictly per generated token may misalign incentives for persistent, memory-heavy work where cache reads dominate.
- Provider and platform design implications
- Providers and tooling vendors may need to offer primitives for stable, auditable state/caches, predictable provider-routing, and cost-accounting primitives oriented to artifacts or sessions rather than raw tokens.
- Marketplace offerings could include pricing plans tuned to persistent environments (e.g., archive/cache storage, artifact-oriented SLAs, governance primitives).
- Measurement and evaluation recommendations for economic analyses
- Future empirical studies should use artifact-level denominators (cost per manuscript, per verified analysis, per deployment), reproducible parsing rules, and standardized correction/governance taxonomies.
- Independent coding of governance events and off-platform cost reconciliation (invoices) are needed to translate telemetry into cash-flow estimates.
- Organizational and policy considerations
- Governance, privacy, and reproducibility costs are central and accrue with persistence; institutions should account for integration, auditability, data residency, and restore/backup maturity when assessing total cost of ownership.
- Economic assessments should include non-token costs: engineering to integrate and maintain state, human oversight/corrections, and compliance overhead.
- Short-term inference for researchers and institutions
- Expect capacity expansion and broader scope of agent-delegated tasks, not immediate labor substitution; ROI estimates should consider long-run marginal cost reductions after memory and reusable procedures are established.
- Budgeting should anticipate spend beyond API tokens (runtime, backups, governance, staffing for oversight), and track cost per artifact rather than raw token volume.
Suggested next steps (for researchers/economists) - Replicate with multiple investigators/teams and independent coders. - Estimate cost-per-artifact using reconciled invoices and artifact outcome registers. - Design experiments comparing episodic vs. persistent agent workflows to quantify marginal productivity and true cost differentials.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Recoverable main-agent telemetry contained 75,671 de-duplicated records across 96 active days, with 8,059 user-role and 23,710 assistant-role messages. Adoption Rate | null_result | high | number of telemetry records and role-specific messages |
n=96
75,671 de-duplicated records; 8,059 user-role messages; 23,710 assistant-role messages
0.18
|
| The workspace included 502 memory-related files, 17 configured agent directories, and 57 skill files. Adoption Rate | null_result | high | counts of workspace memory files, agent directories, and skill files |
502 memory-related files; 17 configured agent directories; 57 skill files
0.18
|
| Active system time was 579.7 hours (30-minute capped-gap estimate). Organizational Efficiency | null_result | high | active system runtime (hours) |
579.7 hours
0.18
|
| Memory-derived records identified 482 output-proxy events and 889 failure, verification, correction, or protocol-proxy events. Error Rate | null_result | high | counts of output-proxy events and counts of failure/verification/correction/protocol-proxy events |
482 output-proxy events; 889 failure/verification/correction/protocol-proxy events
0.18
|
| A strict May 2026 trajectory subset captured 627 model-completed events and 73.95 million recorded tokens, of which 82.9% were cache reads. Organizational Efficiency | null_result | high | model-completed events, total recorded tokens, proportion of tokens served from cache |
n=627
627 model-completed events; 73.95 million recorded tokens; 82.9% cache reads
0.18
|
| The workflow was cache-dominant, suggesting that persistent agentic environments may shift the economic unit from cost per token to cost per completed artifact. Organizational Efficiency | mixed | high | dominance of cache reads (resource-cost implication) and predicted change in cost-denominator from token to artifact |
82.9% cache reads (used to justify 'cache-dominant' characterization)
0.03
|
| Future evaluations should use artifact-level denominators, reproducible parsing rules, correction taxonomies, and independent coding of governance events. Research Productivity | positive | high | recommended methodological practices for future evaluations (artifact-level denominators, parsing rules, taxonomies, independent coding) |
0.03
|