The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

A 12,510-run study finds current legal LLM agents frequently fail to close matters in a single pass, but a modular 'Parthenon' framework and an anti-leakage learning loop substantially raise task accuracy and compliance without retraining models.

Parthenon Law: A Self-Evolving Legal-Agent Framework
Hejia Geng, Leo Liu · June 03, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
Using 12,510 agent trajectories on Harvey LAB, the paper shows frontier legal LLM agents often fail end-to-end but that the Parthenon modular framework plus an anti-leakage learning loop materially improves task accuracy and matter-level performance without changing model weights.

As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products -- yet reliable deployment faces three obstacles: no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; no agent architecture adapted to the legal vertical, only general-purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large-scale empirical study on Harvey LAB -- $12{,}510$ agent trajectories -- shows that even frontier agents remain far from completing matters in a single pass: per-criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textsc{Parthenon}, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience -- as a firm refines its checklists and playbooks after each matter -- without touching model weights. Across our large-scale empirical analysis, \textsc{Parthenon} substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks.

Summary

Main Finding

PARTHENON is a six-layer, auditable legal-agent framework plus an anti-leakage self‑evolving loop that edits harness artifacts (Knowledge, Tools, Skills) rather than model weights. On a large-scale evaluation (12,510 agent trajectories on Harvey LAB), PARTHENON meaningfully improves state-of-the-art workspace-harnessed LLM performance: pooled per-criterion accuracy rises by roughly +7–14 percentage points (to 82.0 / 89.9 / 90.2% across solver tiers) and strict all-pass matter completion roughly triples for weaker solvers (e.g., 14 → 42 and 47 → 137 matters), showing that structured, editable harnesses and deterministic audits yield gains comparable to model upgrades.

Key Points

  • Problem: Frontier LLMs + general-purpose harnesses still fail common, material legal tasks in single pass; per-criterion accuracy improves with stronger models but strict all-pass matter completion remains low.
  • Core idea: Factor legal-agent behavior into six replaceable, auditable surfaces — Model, Harness, Agent, Knowledge, Tools, Skills — so failures can be traced and fixed at the appropriate surface instead of by tuning model weights.
  • Self-evolving loop: Three role separation — SOLVER (draft), EVALUATOR (rubric-based scoring, external), LEARNER (proposes harness edits) — with structural redaction/anti-leakage so learner edits are task-agnostic and cannot memorize rubrics or private facts.
  • Deterministic enforcement: Convert recurring legal invariants (deadlines, citations, numeric/date reconciliation, deliverable shape, issue-closure) into executable Tools and mandatory Release Gates to enforce auditable contracts.
  • Knowledge as data: Store statutes, deadline windows, deliverable schemas, holiday calendars, synonyms, and inference rules as versioned artifacts (2,300+ entries) editable by the learner but strictly prohibited from encoding matter-specific secrets or rubric answers.
  • Skills as procedure: 1,251 versioned, rubric-blind procedural skills encode triage, mandatory tools, coverage, issue lifecycles, and finalization checks; edits are diffs that must pass automated and empirical acceptance gates.
  • Anti-leakage and gated commits: Learner proposals are highly redacted, must compile and pass static safety checks, and are accepted only if they improve per-task pass rates; rejected candidates are logged.
  • Empirical failure modes identified: incomplete source coverage, lost numerical/temporal detail, malformed deliverables, unfinished issue analysis, and weak grounding — many mechanical and addressable via Tools/Skills.

Data & Methods

  • Dataset: Harvey LAB — 1,251 “matters” spanning 24 practice areas; median 7 source documents and 57 rubric criteria per matter; a range of tasks (drafting, analysis, extraction, comparison, review).
  • Execution families compared:
    • Direct API prompting (baseline)
    • Basic legal-native harness
    • Workspace harnesses (Codex-style and Claude Code)
    • Each paired with and without PARTHENON to isolate harness effects
  • Experiments:
    • 12,510 agent trajectories across the harnesses and three model tiers.
    • Solver settings fixed per paired comparison; evaluator uses withheld rubrics (solver never sees rubric or judge feedback).
    • Metrics:
      • Criterion accuracy: percent of individual rubric criteria passed across all matters.
      • All-pass matter completion: percent of matters passing every rubric criterion (strict legal-review-style standard).
  • PARTHENON architecture components:
    • Model layer: pluggable LLM providers (no binding to one family).
    • Harness layer: pluggable workspace runtimes exposing files, tools, traces.
    • Agent layer: role separation (Solver, Evaluator, Learner) to prevent leakage.
    • Knowledge layer: six families (statute catalog, window catalog, deliverable catalog, holiday calendars, legal synonyms, inference rules).
    • Tools layer: 14 deterministic tools across stages (source inspection, retrieval, audit/computation, release gates).
    • Skills layer: 1,251 task-routed procedural skills (seven-part scaffold per skill).
  • Self-evolving protocol:
    • For a minibatch, evaluator produces redacted failure signals; learner proposes task-agnostic edits (batches of diffs) to K/T/G/A; acceptance requires static checks and measured empirical improvement; edits are versioned, reviewable harness commits, not weight updates.

Implications for AI Economics

  • Returns to domain engineering vs model improvements:
    • Measurable uplift from PARTHENON is comparable to a model upgrade, implying substantial returns to investment in domain-specific harness engineering, deterministic tools, and curated legal knowledge — not only to larger/fancier LLMs.
    • Vendors and firms can capture value by building and versioning legal Knowledge/Tools/Skills; these assets may be more durable and cheaper to maintain than continual model fine-tuning.
  • Business model and competitive dynamics:
    • Competitive differentiation may shift toward proprietary, auditable harness artifacts (tool libraries, procedures, knowledge bases) rather than model access alone, reducing model lock-in risk and increasing value of firm-specific playbooks.
    • First movers who accumulate high-quality, reusable skills and tools can scale matter throughput and quality, potentially creating winner-take-most effects in legaltech for routine matter classes.
  • Labor and productivity effects:
    • Deterministic audits and procedural skills reduce routine errors and rework, raising effective productivity for mid/low-complexity legal tasks; this could depress demand for junior drafting work while increasing demand for oversight, skill engineering, and exception handling.
    • Legal firms may reallocate labor toward higher-value, judgment-intensive tasks and harness maintenance (curating knowledge, approving learner edits).
  • Cost structure and deployment economics:
    • Non-parametric improvement (editing harness vs retraining) lowers marginal cost of improving agent performance and reduces compute/labeling costs tied to model fine-tuning, changing investment priorities toward software engineering and governance.
    • Ongoing maintenance costs: keeping Knowledge (statutes, windows, calendars) current and reconciling jurisdictional differences remains necessary and creates recurring operational expense.
  • Liability, regulation, and privacy:
    • PARTHENON’s auditable surfaces and anti-leakage constraints align with regulatory demands for provenance, explainability, and data privacy; this reduces compliance risk and may ease regulators’ acceptance of deployed legal agents.
    • Firms that can demonstrate deterministic audits and redaction rules may face lower malpractice exposure for routine tasks, but legal liability models will need updating to reflect harness vs model responsibilities.
  • Market for auxiliary services:
    • Demand will grow for tooling providers (deterministic legal tools, release-gate frameworks), skill-playbook marketplaces, and independent evaluators/auditors — creating new layers in the legal AI stack.
  • Externalities and risks:
    • Concentration of high-quality knowledge/tool artifacts could restrict competition if artifacts are proprietary and non-interoperable; conversely, PARTHENON’s pluggable model/harness design could foster modular markets.
    • Anti-leakage reduces benchmark/data leakage but also limits rapid adaptation to single-matter peculiarities; economic value depends on the balance between generalizable procedures and matter-specific expertise.

Overall, PARTHENON suggests that in regulated, information-dense verticals like law, engineering harnesses, deterministic audits, and curated knowledge/procedure libraries can deliver substantial economic value and quality improvements — often at lower marginal cost and regulatory risk than continual model retraining.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper reports a large-scale empirical dataset (12,510 agent trajectories) and comparative performance improvements from the Parthenon framework across state-of-the-art models and harnesses, giving substantive descriptive evidence of behavior and gains; however, there is no randomized or quasi-experimental identification of causal effects on economic outcomes, findings are measured on a single (likely proprietary) platform and task suite, and external validation / real-world productivity impacts are not demonstrated. Methods Rigormedium — Rigorous engineering and systematic evaluation are evident (large sample, cross-model comparisons, modular decomposition of agent components, and an audit-friendly design), but the study lacks causal identification, randomized interventions, clear out-of-sample validation or replication on independent datasets/platforms, and details about dataset curation, labeling protocols, and statistical uncertainty are not provided in the summary. Sample12,510 agent trajectories run on the Harvey LAB platform covering end-to-end legal-matter tasks; evaluations compare multiple state-of-the-art model-and-harness combinations and measure per-criterion accuracy, matter completion, and subsequent performance after applying the Parthenon framework and an anti-leakage learning loop. Themeshuman_ai_collab productivity GeneralizabilitySingle platform (Harvey LAB) and proprietary harnesses — results may not transfer to other platforms or open-source pipelines, Task suite likely focused on particular legal matter types and document-heavy workflows; performance may differ on other legal domains or jurisdictions, Models and harnesses reflect contemporary frontier systems — findings may change as base models evolve, Evaluation uses system-internal success criteria; external measures (e.g., lawyer time saved, client outcomes, compliance/legal risk) are not shown, Potential dataset curation or selection biases (which matters were run) may limit representativeness

Claims (8)

ClaimDirectionConfidenceOutcomeDetails
A large-scale empirical study on Harvey LAB used 12,510 agent trajectories. Other null_result high agent trajectories (dataset size)
n=12510
0.3
Even frontier agents remain far from completing matters in a single pass. Task Completion Time negative high matter completion in a single pass (strict end-to-end completion)
n=12510
0.18
Per-criterion accuracy climbs with stronger models. Output Quality positive high per-criterion accuracy
n=12510
0.18
Strict matter completion stalls (does not improve) despite stronger models. Task Completion Time negative high strict matter completion rate
n=12510
0.18
We introduce Parthenon, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Regulatory Compliance positive high auditability (source traceability), date/number grounding, deliverable compliance, issue closure
0.18
An anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience without touching model weights. Training Effectiveness positive high system improvement via edits to skills/tools/knowledge (no model weight changes)
0.18
Across our large-scale empirical analysis, Parthenon substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks. Output Quality positive high performance on legal-matter tasks (aggregate metric unspecified in abstract)
n=12510
0.18
Reliable deployment faces three obstacles: (1) no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; (2) no agent architecture adapted to the legal vertical, only general-purpose harnesses; and (3) no mechanism for systems to learn from their own outcomes in a changing setting. Adoption Rate negative high availability of prior large-scale evidence, existence of legal-specific agent architectures, existence of closed-loop learning mechanisms
0.03

Notes