A self-hosted, 8B-parameter model for Dream11's CriQ reduces text-to-SQL input size by over 99%, eliminates costly external API calls, and improves execution success to 98.4% and semantic accuracy to 92.5%, outperforming a Gemini Flash 2.0 baseline; the approach offers a practical route to lower-latency, lower-cost production deployment for domain-specific conversational SQL applications.
Applying large, proprietary API-based language models to text-to-SQL tasks poses a significant industry challenge: reliance on massive, schema-heavy prompts results in prohibitive per-token API costs and high latency, hindering scalable production deployment. We present a specialized, self-hosted 8B-parameter model designed for a conversational bot in CriQ, a sister app to Dream11, India's largest fantasy sports platform with over 250 million users, that answers user queries about cricket statistics. Our novel two-phase supervised fine-tuning approach enables the model to internalize the entire database schema, eliminating the need for long-context prompts. This reduces input tokens by over 99%, from a 17k-token baseline to fewer than 100, and replaces costly external API calls with efficient local inference. The resulting system achieves 98.4% execution success and 92.5% semantic accuracy, substantially outperforming a prompt-engineered baseline using Google's Gemini Flash 2.0 (95.6% execution, 89.4% semantic accuracy). These results demonstrate a practical path toward high-precision, low-latency text-to-SQL applications using domain-specialized, self-hosted language models in large-scale production environments.
Summary
Main Finding
A novel two-phase supervised fine-tuning recipe enables an 8B-parameter, self-hosted model (Qwen 3 8B + LoRA) to internalize a static database schema so that text-to-SQL generation requires only a tiny prompt (a DB ID + user question). This reduces input tokens by >99% (from ~17k to <100), replaces costly per-token API inference with local serving, and yields higher accuracy (98.4% execution success, 92.5% semantic accuracy) versus a prompt-engineered Gemini Flash baseline (95.6% exec, 89.4% semantic).
Key Points
- Problem: Third-party LLMs require huge schema-heavy prompts (thousands of tokens) for reliable text-to-SQL, causing high per-query API costs and latency — impractical at scale for production.
- Solution: Two-phase fine-tuning that explicitly teaches (1) the text→SQL mapping in full context and (2) explicit schema memorization associated with a DB ID, then (3) enforces recall/application from a minimal prompt.
- Concrete gains:
- Prompt length: ~17k tokens → <100 tokens (>99% reduction).
- Execution success: 98.4% (fine-tuned) vs 95.6% (Gemini prompt-engineered baseline).
- Semantic accuracy: 92.5% vs 89.4%.
- Ablations:
- Training with minimal prompts only (no full-context phase) → 74.96% execution success.
- Omitting explicit schema-memorization examples in Phase 1 → drop to 87.2% exec / 79.5% semantic.
- Demonstrates necessity of both phases and explicit memorization for robust recall.
- Practical deployment: model self-hosted and served with vLLM (Paged-Attention) to achieve lower latency and higher throughput than API-driven approach.
Data & Methods
- Application context: CriQ (cricket statistics conversational assistant) querying a static, richly relational cricket database used by Dream11 ecosystem.
- Ground-truth generation:
- Sampled 11k real user queries from logs.
- Used a self-hosted reasoning LLM (Qwen 32B) in an iterative generate-and-correct loop to produce high-quality question→SQL pairs; successful generations saved; failures retried with DB error feedback.
- Training datasets & curriculum:
- Phase 1 (schema learning & association)
- 80k examples: SQL generation using the full 17k-token prompt (schema + context).
- 15k examples: explicit schema memorization tasks (prompt: “Give full schema for database id = 'cricket stats db v1'”; output: the full schema).
- Training context length: up to 20,480 tokens.
- Phase 2 (schema recall & application)
- 150k examples: minimal prompt only (user question + database ID trigger) forcing model to use internalized schema.
- Reduced context length to 1,024 tokens.
- Phase 1 (schema learning & association)
- Model & fine-tuning:
- Base model: Qwen 3 8B selected after initial benchmark (Qwen 3 8B had 92.61% initial exec accuracy vs other 7–8B models).
- PEFT using LoRA: rank r=64, alpha=128 applied to q,k,v projections.
- Training infra: 8× NVIDIA H100 80GB GPUs, Distributed Data Parallel.
- Optimizer & schedule: AdamW, lr=2e-5, cosine scheduler.
- Reused LoRA adapters between phases (Phase 1 → Phase 2 continuation).
- Evaluation:
- Holdout test set: 30k real-world user queries unseen in training.
- Metrics:
- Execution Success: % generated SQL that executes without DB errors.
- Semantic Accuracy: % executed queries that return correct answers (evaluated by an LLM via batch API).
- Total prompt length (tokens).
- Serving: self-hosted inference via vLLM for lower latency and higher throughput; moved cost model from per-query API billing to fixed infra + ops.
Implications for AI Economics
- Cost structure transformation:
- Large reduction in per-query variable costs by eliminating token-heavy API calls. For high-volume services, moving from pay-per-token APIs to self-hosted inference can yield substantial ongoing savings despite up-front training/infrastructure costs.
- PEFT (LoRA) reduces fine-tuning compute and storage cost compared to full-model retraining, making specialized, self-hosted models economically feasible for domain-specific tasks.
- Latency and user experience:
- Removing large-context prompts reduces inference latency and variability, lowering abandonment and improving UX — important for consumer-scale, interactive features.
- Scalability and throughput:
- Self-hosting with efficient serving (vLLM, paged-attention) enables higher RPS and predictable performance under load vs variable API throttling/latency.
- Trade-offs and operational costs:
- Up-front engineering and compute: generating ground-truth, running two-phase fine-tuning, and provisioning GPUs is non-trivial; smaller organizations must weigh capex/OPEX vs API expenses.
- Maintenance: approach assumes a mostly static schema. Frequent schema changes require retraining or adapter updates, which incurs recurring costs and operational complexity.
- Privacy, compliance, and data control:
- Internalizing schema and running inference on-prem or in controlled cloud reduces data exposure to third-party API providers — economically relevant when regulatory compliance or data sensitivity adds cost to using external services.
- Generalizability and market implications:
- The method provides a practical blueprint for other domains with static/high-value schemas (finance, healthcare, enterprise BI). Firms with high query volumes and stable schemas can justify investing in domain-specialized, self-hosted models for superior economics and performance.
- Strategic considerations:
- Hybrid approaches remain viable: use external APIs for rapidly-changing or low-volume domains, use schema-internalized self-hosted models where query volume, latency, accuracy, or data sensitivity justify the investment.
- Model size and accuracy trade-offs: larger models improve accuracy but increase hosting costs; PEFT lets teams tune adapters to balance cost vs performance.
Overall, this paper demonstrates a concrete, economically attractive path from costly prompt-heavy API deployments to performant, low-latency, domain-specialized self-hosted models by using a structured two-phase fine-tuning curriculum and PEFT.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Reliance on massive, schema-heavy prompts results in prohibitive per-token API costs and high latency, hindering scalable production deployment. Task Completion Time | negative | high | latency and per-token API cost |
0.09
|
| We present a specialized, self-hosted 8B-parameter model designed for a conversational bot in CriQ, a sister app to Dream11 that answers user queries about cricket statistics. Other | positive | high | model specification and deployment |
8B-parameter model
0.18
|
| CriQ is a sister app to Dream11, India's largest fantasy sports platform with over 250 million users. Other | neutral | high | user base size |
n=250000000
over 250 million users
0.09
|
| A novel two-phase supervised fine-tuning approach enables the model to internalize the entire database schema, eliminating the need for long-context prompts. Output Quality | positive | high | need for long-context prompts / model internalization of schema |
0.18
|
| This reduces input tokens by over 99%, from a 17k-token baseline to fewer than 100. Task Completion Time | positive | high | input token count |
over 99%, from a 17k-token baseline to fewer than 100
0.18
|
| The approach replaces costly external API calls with efficient local inference. Organizational Efficiency | positive | high | use of external API calls vs local inference (cost/efficiency implication) |
0.18
|
| The resulting system achieves 98.4% execution success and 92.5% semantic accuracy, substantially outperforming a prompt-engineered baseline using Google's Gemini Flash 2.0 (95.6% execution, 89.4% semantic accuracy). Output Quality | positive | high | execution success rate; semantic accuracy |
98.4% execution success and 92.5% semantic accuracy (vs 95.6% execution, 89.4% semantic accuracy for Gemini Flash 2.0)
0.18
|
| These results demonstrate a practical path toward high-precision, low-latency text-to-SQL applications using domain-specialized, self-hosted language models in large-scale production environments. Adoption Rate | positive | high | feasibility of production-grade text-to-SQL (precision and latency) |
0.03
|