A self-hosted, 8B-parameter model for Dream11's CriQ reduces text-to-SQL input size by over 99%, eliminates costly external API calls, and improves execution success to 98.4% and semantic accuracy to 92.5%, outperforming a Gemini Flash 2.0 baseline; the approach offers a practical route to lower-latency, lower-cost production deployment for domain-specific conversational SQL applications.

Schema on the Inside: A Two-Phase Fine-Tuning Method for High-Efficiency Text-to-SQL at Scale

Chinmay Soni, Shivam Chourasia, Gaurav Kumar, Hitesh Kapoor · March 25, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

A domain-specialized, self-hosted 8B model fine-tuned with a two-phase procedure internalizes a database schema to cut input tokens by >99% and deliver higher text-to-SQL execution (98.4%) and semantic accuracy (92.5%) than a prompt-engineered Gemini Flash 2.0 baseline.

Applying large, proprietary API-based language models to text-to-SQL tasks poses a significant industry challenge: reliance on massive, schema-heavy prompts results in prohibitive per-token API costs and high latency, hindering scalable production deployment. We present a specialized, self-hosted 8B-parameter model designed for a conversational bot in CriQ, a sister app to Dream11, India's largest fantasy sports platform with over 250 million users, that answers user queries about cricket statistics. Our novel two-phase supervised fine-tuning approach enables the model to internalize the entire database schema, eliminating the need for long-context prompts. This reduces input tokens by over 99%, from a 17k-token baseline to fewer than 100, and replaces costly external API calls with efficient local inference. The resulting system achieves 98.4% execution success and 92.5% semantic accuracy, substantially outperforming a prompt-engineered baseline using Google's Gemini Flash 2.0 (95.6% execution, 89.4% semantic accuracy). These results demonstrate a practical path toward high-precision, low-latency text-to-SQL applications using domain-specialized, self-hosted language models in large-scale production environments.

Summary

Main Finding

A novel two-phase supervised fine-tuning recipe enables an 8B-parameter, self-hosted model (Qwen 3 8B + LoRA) to internalize a static database schema so that text-to-SQL generation requires only a tiny prompt (a DB ID + user question). This reduces input tokens by >99% (from ~17k to <100), replaces costly per-token API inference with local serving, and yields higher accuracy (98.4% execution success, 92.5% semantic accuracy) versus a prompt-engineered Gemini Flash baseline (95.6% exec, 89.4% semantic).

Key Points

Problem: Third-party LLMs require huge schema-heavy prompts (thousands of tokens) for reliable text-to-SQL, causing high per-query API costs and latency — impractical at scale for production.
Solution: Two-phase fine-tuning that explicitly teaches (1) the text→SQL mapping in full context and (2) explicit schema memorization associated with a DB ID, then (3) enforces recall/application from a minimal prompt.
Concrete gains:
- Prompt length: ~17k tokens → <100 tokens (>99% reduction).
- Execution success: 98.4% (fine-tuned) vs 95.6% (Gemini prompt-engineered baseline).
- Semantic accuracy: 92.5% vs 89.4%.
Ablations:
- Training with minimal prompts only (no full-context phase) → 74.96% execution success.
- Omitting explicit schema-memorization examples in Phase 1 → drop to 87.2% exec / 79.5% semantic.
- Demonstrates necessity of both phases and explicit memorization for robust recall.
Practical deployment: model self-hosted and served with vLLM (Paged-Attention) to achieve lower latency and higher throughput than API-driven approach.

Data & Methods

Application context: CriQ (cricket statistics conversational assistant) querying a static, richly relational cricket database used by Dream11 ecosystem.
Ground-truth generation:
- Sampled 11k real user queries from logs.
- Used a self-hosted reasoning LLM (Qwen 32B) in an iterative generate-and-correct loop to produce high-quality question→SQL pairs; successful generations saved; failures retried with DB error feedback.
Training datasets & curriculum:
- Phase 1 (schema learning & association)
  - 80k examples: SQL generation using the full 17k-token prompt (schema + context).
  - 15k examples: explicit schema memorization tasks (prompt: “Give full schema for database id = 'cricket stats db v1'”; output: the full schema).
  - Training context length: up to 20,480 tokens.
- Phase 2 (schema recall & application)
  - 150k examples: minimal prompt only (user question + database ID trigger) forcing model to use internalized schema.
  - Reduced context length to 1,024 tokens.
Model & fine-tuning:
- Base model: Qwen 3 8B selected after initial benchmark (Qwen 3 8B had 92.61% initial exec accuracy vs other 7–8B models).
- PEFT using LoRA: rank r=64, alpha=128 applied to q,k,v projections.
- Training infra: 8× NVIDIA H100 80GB GPUs, Distributed Data Parallel.
- Optimizer & schedule: AdamW, lr=2e-5, cosine scheduler.
- Reused LoRA adapters between phases (Phase 1 → Phase 2 continuation).
Evaluation:
- Holdout test set: 30k real-world user queries unseen in training.
- Metrics:
  - Execution Success: % generated SQL that executes without DB errors.
  - Semantic Accuracy: % executed queries that return correct answers (evaluated by an LLM via batch API).
  - Total prompt length (tokens).
Serving: self-hosted inference via vLLM for lower latency and higher throughput; moved cost model from per-query API billing to fixed infra + ops.

Implications for AI Economics

Cost structure transformation:
- Large reduction in per-query variable costs by eliminating token-heavy API calls. For high-volume services, moving from pay-per-token APIs to self-hosted inference can yield substantial ongoing savings despite up-front training/infrastructure costs.
- PEFT (LoRA) reduces fine-tuning compute and storage cost compared to full-model retraining, making specialized, self-hosted models economically feasible for domain-specific tasks.
Latency and user experience:
- Removing large-context prompts reduces inference latency and variability, lowering abandonment and improving UX — important for consumer-scale, interactive features.
Scalability and throughput:
- Self-hosting with efficient serving (vLLM, paged-attention) enables higher RPS and predictable performance under load vs variable API throttling/latency.
Trade-offs and operational costs:
- Up-front engineering and compute: generating ground-truth, running two-phase fine-tuning, and provisioning GPUs is non-trivial; smaller organizations must weigh capex/OPEX vs API expenses.
- Maintenance: approach assumes a mostly static schema. Frequent schema changes require retraining or adapter updates, which incurs recurring costs and operational complexity.
Privacy, compliance, and data control:
- Internalizing schema and running inference on-prem or in controlled cloud reduces data exposure to third-party API providers — economically relevant when regulatory compliance or data sensitivity adds cost to using external services.
Generalizability and market implications:
- The method provides a practical blueprint for other domains with static/high-value schemas (finance, healthcare, enterprise BI). Firms with high query volumes and stable schemas can justify investing in domain-specialized, self-hosted models for superior economics and performance.
Strategic considerations:
- Hybrid approaches remain viable: use external APIs for rapidly-changing or low-volume domains, use schema-internalized self-hosted models where query volume, latency, accuracy, or data sensitivity justify the investment.
- Model size and accuracy trade-offs: larger models improve accuracy but increase hosting costs; PEFT lets teams tune adapters to balance cost vs performance.

Overall, this paper demonstrates a concrete, economically attractive path from costly prompt-heavy API deployments to performant, low-latency, domain-specialized self-hosted models by using a structured two-phase fine-tuning curriculum and PEFT.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides direct empirical comparisons on text-to-SQL accuracy and execution success between a domain-specialized, self-hosted 8B model and a prompt-engineered Gemini Flash 2.0 baseline, with large reported improvements and clear operational claims (tokens, latency, cost substitution). However, evidence is limited to a proprietary, domain-specific dataset and a single baseline; key details (test set size, diversity, statistical significance, and external replication) are not reported, so the results are persuasive for this deployment but not definitive or broadly causal. Methods Rigormedium — The authors propose a clear two-phase supervised fine-tuning procedure and report relevant performance metrics (execution success, semantic accuracy, token reduction), and they benchmark against a strong closed-source baseline. Missing methodological details weaken rigor: dataset size and split, selection/annotation procedures, hyperparameters and training stability, statistical tests/uncertainty, ablation studies, and reproducibility artifacts are not described in the summary. SampleProprietary text-to-SQL dataset of user queries and SQL interactions for CriQ (a cricket-stats conversational bot tied to Dream11's platform); model trained via a two-phase supervised fine-tuning pipeline to internalize the full database schema; evaluation against a prompt-engineered baseline using Google's Gemini Flash 2.0 on held-out examples (exact dataset size, annotation process, and train/validation/test splits not reported). Themesadoption productivity GeneralizabilityDomain-specific (cricket statistics) and tied to a single application's database schema, so results may not transfer to other domains or heterogeneous schemas, Proprietary dataset and evaluation details (size, diversity, annotation) are not disclosed, limiting external validation, Comparison uses one prompt-engineered baseline (Gemini Flash 2.0); performance vs other commercial or open models may differ, Model size (8B) and self-hosted infra assumptions may not generalize to firms with different compute/cost constraints, Potential overfitting to a single schema or query distribution; unclear robustness to schema changes or novel question types, Operational trade-offs (training cost, maintenance, security, compliance) and lifecycle costs are not fully quantified

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Reliance on massive, schema-heavy prompts results in prohibitive per-token API costs and high latency, hindering scalable production deployment. Task Completion Time	negative	high	latency and per-token API cost	0.09
We present a specialized, self-hosted 8B-parameter model designed for a conversational bot in CriQ, a sister app to Dream11 that answers user queries about cricket statistics. Other	positive	high	model specification and deployment	8B-parameter model 0.18
CriQ is a sister app to Dream11, India's largest fantasy sports platform with over 250 million users. Other	neutral	high	user base size	n=250000000 over 250 million users 0.09
A novel two-phase supervised fine-tuning approach enables the model to internalize the entire database schema, eliminating the need for long-context prompts. Output Quality	positive	high	need for long-context prompts / model internalization of schema	0.18
This reduces input tokens by over 99%, from a 17k-token baseline to fewer than 100. Task Completion Time	positive	high	input token count	over 99%, from a 17k-token baseline to fewer than 100 0.18
The approach replaces costly external API calls with efficient local inference. Organizational Efficiency	positive	high	use of external API calls vs local inference (cost/efficiency implication)	0.18
The resulting system achieves 98.4% execution success and 92.5% semantic accuracy, substantially outperforming a prompt-engineered baseline using Google's Gemini Flash 2.0 (95.6% execution, 89.4% semantic accuracy). Output Quality	positive	high	execution success rate; semantic accuracy	98.4% execution success and 92.5% semantic accuracy (vs 95.6% execution, 89.4% semantic accuracy for Gemini Flash 2.0) 0.18
These results demonstrate a practical path toward high-precision, low-latency text-to-SQL applications using domain-specialized, self-hosted language models in large-scale production environments. Adoption Rate	positive	high	feasibility of production-grade text-to-SQL (precision and latency)	0.03