A self‑hosted legal small language model matches frontier LLMs on contract extraction and slashes inference costs by up to 97%, reducing hallucinations that drive review burden.
This paper evaluates whether a domain trained Small Language Model (SLM) can outperform frontier Large Language Models on structured contract extraction at radically lower cost. We test Olava Extract, a self hosted legal domain Mixture of Experts model, against five frontier models. Olava Extract achieved the strongest aggregate performance in the study, with a macro F1 of 0.812 and a micro F1 of 0.842, while reducing inference cost by 78% to 97% compared with the frontier models tested. It also achieved the highest precision scores, producing fewer hallucinated and unsupported extractions, an important distinction in legal workflows where hallucinations create operational risk and downstream review burden. The findings shows that high performing, human comparable legal AI no longer requires the largest externally hosted models. More broadly, they challenge the assumption that commercially valuable enterprise AI capability must remain tied to ever larger models, massive infrastructure expenditure, and centrally hosted providers.
Summary
Main Finding
A domain-trained Small Language Model (Olava Extract — a self-hosted Mixture-of-Experts SLM fine-tuned for contracts) matched or slightly outperformed five commercially available frontier LLMs on a structured contract-extraction task while operating at a small fraction of their inference cost. Aggregate results: micro F1 = 0.842, macro F1 = 0.812; Olava also achieved the highest precision (micro precision 0.812, macro precision 0.780) and produced fewer unsupported/hallucinated extractions. Under batched inference, extraction cost was 78%–97% lower than the frontier models tested.
Key Points
- Task and scope
- Full-document structured extraction of 26 target fields (clauses, dates, durations, currency values, boolean/categorical picks, short text identifiers).
- Evaluation: 24 held-out public SEC EDGAR contracts, 508 human-labelled field instances annotated by lawyers (each contract reviewed by ≥2 lawyers; disagreements adjudicated).
- Output required both a normalized display answer and supporting verbatim contract span(s).
- Models compared
- Olava Extract (domain-trained SLM, MoE-style active footprint) vs Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro Preview, GPT-5.4.
- Frontier baselines evaluated zero-shot with the same system/user prompts and structured output spec.
- Performance highlights
- Olava Extract ranked highest on both micro and macro F1, though margins over top frontier models were small.
- Strongest category performance: short-text identifiers and extracted text concepts; weakest: currency fields (aggregation/normalization issues).
- Olava showed a precision-oriented profile (fewer hallucinations/unsupported answers), valuable for legal workflows where false positives impose review burden and risk.
- Cost & speed
- Olava inference run on 2 H200 SXM GPUs (vLLM), priced at $4.01/H200-hour for cost estimates.
- Batched inference cost reductions relative to frontier APIs: 78%–97%.
- Two self-hosted cost figures reported: parallel-batched (wall-clock) and unbatched-serial (sum of per-document runtime).
- Evaluation protocol specifics that affect interpretation
- No chunking/retrieval/vector-store augmentation — full contract inserted into one model call (max context 262,144 tokens for Olava).
- Field-specific matching rules (no date/currency normalization) — scoring used display-answer and/or span overlap per field-type.
- Training labels were synthetically generated by a frontier LLM and filtered with an LLM-as-judge pipeline (fuzzy-span matching).
- Fine-tuning: LoRA (rank 32), one epoch, bfloat16; training labels = 89,517, validation = 5,453.
- No statistical significance testing reported.
- Limitations noted in paper
- Small held-out evaluation (24 contracts); results may not generalize across broader contract types or private client documents.
- Training labels synthetic (quality depends on judge-filtering); frontier baselines were zero-shot (no few-shot examples).
- Cost/latency comparisons depend on specific hardware, batching strategy, API pricing, and observed token use — not claimed as universal.
- No normalization of dates/currency at scoring time may affect numeric/date fields.
Data & Methods
- Data sources
- Model development and evaluation used public SEC EDGAR exhibits only. Training pool limited to documents of 10k–100k tokens and those with at least 22/26 target fields on a preliminary pass.
- Training/validation split: 89,517 training labels / 5,453 validation labels (stratified by contract type; held-out evaluation contracts excluded by ID).
- Labeling
- Training labels synthesized using a frontier LLM, then filtered and scored by an LLM-as-judge panel; cited spans fuzzy-matched back to sources to produce display answers + supporting spans.
- Held-out evaluation labels created by legal professionals in Label Studio (dual review + senior adjudication).
- Model development
- Olava Extract: domain-adapted SLM implemented as MoE with an active inference footprint comparable to smaller dense models.
- Fine-tuning: LoRA parameter-efficient tuning, bfloat16, single epoch; best checkpoint chosen by validation loss.
- Inference & evaluation
- Single-turn, full-document extraction: all 26 fields extracted in one call per contract, same prompts and expected structured schema for all models.
- Matching rules: data-type-specific (span overlap for clause fields, substring/span checks for durations/dates/currency, exact equality for closed-set categories and short-text identifiers).
- Metrics: per-field precision/recall/F1, aggregated as macro (equal-weight per field) and micro (pooled instances). No CI or significance tests.
- Cost model: Olava on 2 H200 GPUs; frontier model cost from observed token usage × published API rates. Reported batched and serial self-hosted cost estimates.
Implications for AI Economics
- Enterprise AI can decouple capability from sheer model scale
- Domain-specialized, parameter-efficient SLMs (including MoE designs) can achieve frontier-level performance on a commercially meaningful task at much lower inference costs and with self-hosting benefits (data control, versioning, deployment stability).
- This undermines the assumption that best-in-class enterprise AI must be delivered via the largest centrally hosted LLMs; economic value can accrue from targeted, smaller models tailored to domain needs.
- Cost structure and total-cost-of-ownership (TCO) trade-offs
- Large reductions in per-document inference cost (78%–97% in batched settings here) imply substantial savings at scale versus API-hosted frontier models — potentially changing the ROI calculus for automating legal review workflows.
- However, enterprises must weigh hardware + ops + maintenance + model-update costs against API fees; self-hosting shifts costs from variable per-token fees to capital & operational expenses and engineering support.
- Labor-market and process impacts
- Higher-throughput, lower-cost contract extraction raises economic pressure on labor models that rely on junior lawyers or LPOs for high-volume extraction/review tasks — accelerating substitution and workflow redesign.
- The precision-first behavior of the domain SLM (fewer hallucinations) reduces downstream human verification burden and risk — improving operational viability in regulated or high-liability settings.
- Market structure and vendor dynamics
- Demand may grow for specialist domain models and fine-tuning/ops vendors rather than pure-play API LLM providers for certain enterprise use cases (legal, healthcare, finance).
- Cloud/API providers will still be attractive for rapid experimentation and edge cases; hybrid models (domain SLMs for core tasks + frontier LLMs for exploratory/edge reasoning) are plausible.
- Caveats for economic decision-making
- Results are task- and dataset-specific. Small evaluation size, synthetic training labels, and zero-shot frontier baselines mean buyers should require larger-scale, replicated evaluations and production pilots before committing fully to a self-hosted SLM replacement.
- Legal/regulatory risk, auditing, and long-term maintenance (retraining, label drift, model governance) add recurring costs not fully captured by per-document inference comparisons.
- Domain SLMs may underperform outside the trained contract distributions; enterprises must assess coverage and failure modes carefully.
Suggested practical next steps for decision-makers - Run a production pilot/A-B test on in-scope contract types to measure real-world accuracy, human review time saved, and end-to-end TCO. - Expand evaluation dataset to a broader set of contract types (including private contracts) and perform statistical testing and error analysis, especially on low-performing fields (currency aggregation, some date fields). - Model governance: set verification workflows for low-confidence fields and implement logging/auditing for legal compliance.
If you want, I can (a) extract the key numeric results and the per-field leaderboard into a compact table, (b) produce a brief TCO comparison template to estimate breakeven points for self-hosting vs API usage, or (c) outline a pilot experiment design for validating these results in your organisation.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Olava Extract achieved the strongest aggregate performance in the study, with a macro F1 of 0.812 and a micro F1 of 0.842. Output Quality | positive | high | F1 score (macro and micro) |
macro F1 of 0.812; micro F1 of 0.842
0.18
|
| Olava Extract reduced inference cost by 78% to 97% compared with the frontier models tested. Organizational Efficiency | positive | high | inference cost |
78% to 97% reduction in inference cost
0.18
|
| Olava Extract achieved the highest precision scores, producing fewer hallucinated and unsupported extractions. Output Quality | positive | medium | precision score and hallucination (unsupported extraction) rate |
0.11
|
| Fewer hallucinations and unsupported extractions reduce operational risk and downstream review burden in legal workflows. Organizational Efficiency | positive | medium | operational risk and downstream review burden |
0.02
|
| High-performing, human-comparable legal AI no longer requires the largest externally hosted models. Adoption Rate | positive | medium | requirement of large externally hosted models for high-performing legal AI (implication for adoption/architecture) |
0.02
|
| The study tested Olava Extract against five frontier models. Other | null_result | high | number of comparator models |
n=5
0.18
|
| The findings challenge the assumption that commercially valuable enterprise AI capability must remain tied to ever larger models, massive infrastructure expenditure, and centrally hosted providers. Market Structure | negative | medium | relationship between model size/hosting/infrastructure and commercial enterprise AI capability (market structure implication) |
0.02
|