A self‑hosted legal small language model matches frontier LLMs on contract extraction and slashes inference costs by up to 97%, reducing hallucinations that drive review burden.

A Few Good Clauses: Comparing LLMs vs Domain-Trained Small Language Models on Structured Contract Extraction

Nicole Lincoln, Nick Whitehouse, Jaron Mar, Rivindu Perera · May 07, 2026

arxiv descriptive medium evidence 8/10 relevance Source PDF

A domain-trained, self-hosted small MoE model (Olava Extract) matches or exceeds five frontier LLMs on structured legal contract extraction while cutting inference costs by 78–97% and producing fewer hallucinated extractions.

This paper evaluates whether a domain trained Small Language Model (SLM) can outperform frontier Large Language Models on structured contract extraction at radically lower cost. We test Olava Extract, a self hosted legal domain Mixture of Experts model, against five frontier models. Olava Extract achieved the strongest aggregate performance in the study, with a macro F1 of 0.812 and a micro F1 of 0.842, while reducing inference cost by 78% to 97% compared with the frontier models tested. It also achieved the highest precision scores, producing fewer hallucinated and unsupported extractions, an important distinction in legal workflows where hallucinations create operational risk and downstream review burden. The findings shows that high performing, human comparable legal AI no longer requires the largest externally hosted models. More broadly, they challenge the assumption that commercially valuable enterprise AI capability must remain tied to ever larger models, massive infrastructure expenditure, and centrally hosted providers.

Summary

Main Finding

A domain-trained Small Language Model (Olava Extract — a self-hosted Mixture-of-Experts SLM fine-tuned for contracts) matched or slightly outperformed five commercially available frontier LLMs on a structured contract-extraction task while operating at a small fraction of their inference cost. Aggregate results: micro F1 = 0.842, macro F1 = 0.812; Olava also achieved the highest precision (micro precision 0.812, macro precision 0.780) and produced fewer unsupported/hallucinated extractions. Under batched inference, extraction cost was 78%–97% lower than the frontier models tested.

Key Points

Task and scope
- Full-document structured extraction of 26 target fields (clauses, dates, durations, currency values, boolean/categorical picks, short text identifiers).
- Evaluation: 24 held-out public SEC EDGAR contracts, 508 human-labelled field instances annotated by lawyers (each contract reviewed by ≥2 lawyers; disagreements adjudicated).
- Output required both a normalized display answer and supporting verbatim contract span(s).
Models compared
- Olava Extract (domain-trained SLM, MoE-style active footprint) vs Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro Preview, GPT-5.4.
- Frontier baselines evaluated zero-shot with the same system/user prompts and structured output spec.
Performance highlights
- Olava Extract ranked highest on both micro and macro F1, though margins over top frontier models were small.
- Strongest category performance: short-text identifiers and extracted text concepts; weakest: currency fields (aggregation/normalization issues).
- Olava showed a precision-oriented profile (fewer hallucinations/unsupported answers), valuable for legal workflows where false positives impose review burden and risk.
Cost & speed
- Olava inference run on 2 H200 SXM GPUs (vLLM), priced at $4.01/H200-hour for cost estimates.
- Batched inference cost reductions relative to frontier APIs: 78%–97%.
- Two self-hosted cost figures reported: parallel-batched (wall-clock) and unbatched-serial (sum of per-document runtime).
Evaluation protocol specifics that affect interpretation
- No chunking/retrieval/vector-store augmentation — full contract inserted into one model call (max context 262,144 tokens for Olava).
- Field-specific matching rules (no date/currency normalization) — scoring used display-answer and/or span overlap per field-type.
- Training labels were synthetically generated by a frontier LLM and filtered with an LLM-as-judge pipeline (fuzzy-span matching).
- Fine-tuning: LoRA (rank 32), one epoch, bfloat16; training labels = 89,517, validation = 5,453.
- No statistical significance testing reported.
Limitations noted in paper
- Small held-out evaluation (24 contracts); results may not generalize across broader contract types or private client documents.
- Training labels synthetic (quality depends on judge-filtering); frontier baselines were zero-shot (no few-shot examples).
- Cost/latency comparisons depend on specific hardware, batching strategy, API pricing, and observed token use — not claimed as universal.
- No normalization of dates/currency at scoring time may affect numeric/date fields.

Data & Methods

Data sources
- Model development and evaluation used public SEC EDGAR exhibits only. Training pool limited to documents of 10k–100k tokens and those with at least 22/26 target fields on a preliminary pass.
- Training/validation split: 89,517 training labels / 5,453 validation labels (stratified by contract type; held-out evaluation contracts excluded by ID).
Labeling
- Training labels synthesized using a frontier LLM, then filtered and scored by an LLM-as-judge panel; cited spans fuzzy-matched back to sources to produce display answers + supporting spans.
- Held-out evaluation labels created by legal professionals in Label Studio (dual review + senior adjudication).
Model development
- Olava Extract: domain-adapted SLM implemented as MoE with an active inference footprint comparable to smaller dense models.
- Fine-tuning: LoRA parameter-efficient tuning, bfloat16, single epoch; best checkpoint chosen by validation loss.
Inference & evaluation
- Single-turn, full-document extraction: all 26 fields extracted in one call per contract, same prompts and expected structured schema for all models.
- Matching rules: data-type-specific (span overlap for clause fields, substring/span checks for durations/dates/currency, exact equality for closed-set categories and short-text identifiers).
- Metrics: per-field precision/recall/F1, aggregated as macro (equal-weight per field) and micro (pooled instances). No CI or significance tests.
- Cost model: Olava on 2 H200 GPUs; frontier model cost from observed token usage × published API rates. Reported batched and serial self-hosted cost estimates.

Implications for AI Economics

Enterprise AI can decouple capability from sheer model scale
- Domain-specialized, parameter-efficient SLMs (including MoE designs) can achieve frontier-level performance on a commercially meaningful task at much lower inference costs and with self-hosting benefits (data control, versioning, deployment stability).
- This undermines the assumption that best-in-class enterprise AI must be delivered via the largest centrally hosted LLMs; economic value can accrue from targeted, smaller models tailored to domain needs.
Cost structure and total-cost-of-ownership (TCO) trade-offs
- Large reductions in per-document inference cost (78%–97% in batched settings here) imply substantial savings at scale versus API-hosted frontier models — potentially changing the ROI calculus for automating legal review workflows.
- However, enterprises must weigh hardware + ops + maintenance + model-update costs against API fees; self-hosting shifts costs from variable per-token fees to capital & operational expenses and engineering support.
Labor-market and process impacts
- Higher-throughput, lower-cost contract extraction raises economic pressure on labor models that rely on junior lawyers or LPOs for high-volume extraction/review tasks — accelerating substitution and workflow redesign.
- The precision-first behavior of the domain SLM (fewer hallucinations) reduces downstream human verification burden and risk — improving operational viability in regulated or high-liability settings.
Market structure and vendor dynamics
- Demand may grow for specialist domain models and fine-tuning/ops vendors rather than pure-play API LLM providers for certain enterprise use cases (legal, healthcare, finance).
- Cloud/API providers will still be attractive for rapid experimentation and edge cases; hybrid models (domain SLMs for core tasks + frontier LLMs for exploratory/edge reasoning) are plausible.
Caveats for economic decision-making
- Results are task- and dataset-specific. Small evaluation size, synthetic training labels, and zero-shot frontier baselines mean buyers should require larger-scale, replicated evaluations and production pilots before committing fully to a self-hosted SLM replacement.
- Legal/regulatory risk, auditing, and long-term maintenance (retraining, label drift, model governance) add recurring costs not fully captured by per-document inference comparisons.
- Domain SLMs may underperform outside the trained contract distributions; enterprises must assess coverage and failure modes carefully.

Suggested practical next steps for decision-makers - Run a production pilot/A-B test on in-scope contract types to measure real-world accuracy, human review time saved, and end-to-end TCO. - Expand evaluation dataset to a broader set of contract types (including private contracts) and perform statistical testing and error analysis, especially on low-performing fields (currency aggregation, some date fields). - Model governance: set verification workflows for low-confidence fields and implement logging/auditing for legal compliance.

If you want, I can (a) extract the key numeric results and the per-field leaderboard into a compact table, (b) produce a brief TCO comparison template to estimate breakeven points for self-hosting vs API usage, or (c) outline a pilot experiment design for validating these results in your organisation.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper reports direct evaluation metrics (macro/micro F1, precision) and cost estimates comparing a domain-trained self‑hosted Small Language Model (Olava Extract) to five frontier models, which provides concrete empirical evidence for the claims. However, key details that affect confidence are missing or unclear in the summary: dataset size and representativeness, labeling and test-train splits, whether evaluations used multiple random seeds or statistical tests, exact prompt/hyperparameter parity across models, and how inference costs were measured (hardware, batching, throughput). These omissions leave open concerns about overfitting to a narrow domain, evaluation bias, and reproducibility. Methods Rigormedium — The study uses standard task metrics (macro/micro F1, precision) and compares multiple models, which is appropriate for benchmarking. It also reports cost comparisons and hallucination-related outcomes that are practically important in legal workflows. However, rigor is reduced by likely absence of (or unspecified) controls for prompt and tuning differences across models, lack of reported statistical significance or confidence intervals, no detailed cost-accounting methodology in the summary (e.g., hardware, amortization, throughput), and unclear dataset provenance/size—factors that can materially affect benchmark outcomes. SampleEvaluation of Olava Extract (a self-hosted legal-domain Mixture-of-Experts Small Language Model) on structured contract-extraction tasks, compared against five frontier externally hosted models; reported metrics include macro F1 (0.812), micro F1 (0.842), precision, hallucination/unsupported extraction counts, and inference cost reductions (78%–97%); dataset described as legal contracts (details on size, splits, annotation process, languages, and diversity not provided in the summary). Themesproductivity adoption human_ai_collab GeneralizabilityResults are specific to structured contract-extraction in the legal domain and may not transfer to other NLP tasks (e.g., open-ended generation, summarization, litigation prediction)., Performance may depend on the particular contracts, contract clauses, jurisdictions, and languages represented in the (possibly proprietary) test set., Cost estimates depend on deployment choices (hardware, batching, latency requirements, volume) and may not generalize to other infrastructure or usage patterns., Comparative outcomes may change with different prompt engineering, retrieval augmentation, or tuning applied to frontier models., Self-hosted MoE architectures and access to domain training data may not be available to all organizations, limiting external applicability.

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Olava Extract achieved the strongest aggregate performance in the study, with a macro F1 of 0.812 and a micro F1 of 0.842. Output Quality	positive	high	F1 score (macro and micro)	macro F1 of 0.812; micro F1 of 0.842 0.18
Olava Extract reduced inference cost by 78% to 97% compared with the frontier models tested. Organizational Efficiency	positive	high	inference cost	78% to 97% reduction in inference cost 0.18
Olava Extract achieved the highest precision scores, producing fewer hallucinated and unsupported extractions. Output Quality	positive	medium	precision score and hallucination (unsupported extraction) rate	0.11
Fewer hallucinations and unsupported extractions reduce operational risk and downstream review burden in legal workflows. Organizational Efficiency	positive	medium	operational risk and downstream review burden	0.02
High-performing, human-comparable legal AI no longer requires the largest externally hosted models. Adoption Rate	positive	medium	requirement of large externally hosted models for high-performing legal AI (implication for adoption/architecture)	0.02
The study tested Olava Extract against five frontier models. Other	null_result	high	number of comparator models	n=5 0.18
The findings challenge the assumption that commercially valuable enterprise AI capability must remain tied to ever larger models, massive infrastructure expenditure, and centrally hosted providers. Market Structure	negative	medium	relationship between model size/hosting/infrastructure and commercial enterprise AI capability (market structure implication)	0.02