A new foundation model for tabular data (Schema-1) learns tables natively and outperforms current tabular methods on prediction and imputation benchmarks; if its benchmark gains transfer to production datasets, it could materially reduce preprocessing overhead and accelerate adoption of AI across data-driven firms.

Data Language Models: A New Foundation Model Class for Tabular Data

Eda Erol, Giuliano Pezzoli, Ozer Cem Kelahmet · May 07, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Schema-1 is a 140M-parameter 'Data Language Model' trained on over 2.3M tabular datasets that, according to reported benchmarks, outperforms gradient-boosted trees, AutoML, and prior tabular foundation models on row-level prediction, missing-value imputation, and dataset-sector classification without requiring traditional preprocessing.

Every major data modality now has a foundation model that understands it natively: text has language models, images have vision models, audio has audio models. Tabular data, the modality on which many consequential real-world AI decisions are made, does not. Every approach to tabular AI today, from gradient-boosted trees to the latest tabular foundation models, requires a preprocessing pipeline before any model can consume the data. None of them understand tabular data as a modality. We introduce the Data Language Model (DLM), the missing foundation model for tabular data. A DLM understands tables the way a language model understands sentences: natively, without serialization or preprocessing, directly from raw cell values. It is the tabular data layer on which AI models, agents, and vertical AI applications can be built, eliminating the preprocessing pipelines that currently stand between raw data and every AI system that consumes it. We present Schema-1, the first DLM: a 140M parameter model trained on more than 2.3M synthetic and real-world tabular datasets. Schema-1 outperforms gradient-boosted ensembles, AutoML stacks, and the tabular foundation models we evaluate on established row-level prediction benchmarks. On missing value reconstruction it achieves lower reconstruction error than all classical statistical methods and frontier large language models on mean performance across conditions, establishing that structural understanding of a dataset's own distributional geometry is more useful for imputation than world knowledge encoded in language. It identifies the industry sector of any unseen dataset from raw cell values alone, reliably across any domain, a task no prior tabular model can perform. It is the native tabular understanding layer that has been missing from the AI stack.

Summary

Main Finding

The authors introduce the Data Language Model (DLM), a new foundation-model class that natively understands tabular data (tables) without serialization or bespoke preprocessing. They present Schema-1, a 140M-parameter DLM trained on >2.3M synthetic and real-world tabular datasets, and show it (a) eliminates the human domain-specification and preprocessing pipeline, (b) produces dataset-level domain/sector identification as a co-equal inference output, and (c) outperforms gradient-boosted trees, AutoML stacks, tabular foundation models, and serialized-LLM approaches on evaluated row-level prediction, imputation, and a new blind dataset-sector classification task (Schema-1 achieves 91.4% top-1 accuracy).

Key Points

Definition of a DLM: a foundation model that simultaneously satisfies three conditions:
Multi-signal native ingestion — jointly encodes column semantics, per-column distributional properties, raw cell values, and missingness.
Dataset-level contextual inference — returns structured identification of the dataset’s domain/sector derived solely from the dataset’s structural and distributional properties (no external metadata).
Metadata-independent operation — removing column names/metadata degrades predictive performance only slightly (quantified; Schema-1 retains 98.8% predictive performance; ε = 0.0117).
Schema-1: first practical DLM instantiation (140M parameters; trained on >2.3M datasets, synthetic + real).
New benchmark/task: blind dataset sector classification — identify an unseen dataset’s industry sector from raw cells alone; Schema-1: 91.4% top-1 accuracy (a task prior tabular models could not perform).
Empirical wins: Schema-1 ranks first across evaluated benchmarks — row-level prediction, missing-value reconstruction (lower reconstruction error than classical statistical baselines and LLMs on mean performance), and sector classification.
Column-agnostic experiments attribute imputation advantage to structural co-distributional learning rather than semantic column names (i.e., model learns dataset geometry, not just world knowledge encoded in text).

Data & Methods

Model and training data:
- Schema-1: 140M parameters.
- Training corpus: >2.3M tabular datasets (mixture of synthetic and real-world tables).
- Training objectives and exact loss functions are not detailed in the excerpt, but evaluation tasks include supervised prediction, imputation, and sector classification as primary model outputs.
Input encoding / operating contract:
- Joint multi-signal encoder ingests raw cell values, column identifiers (if available), per-column distributional summaries, and missing-value patterns into a unified representation.
- Produces both numerical predictions (row-level outputs) and dataset-level structured domain/sector identification in the same inference pass.
- Explicitly robust to missing/absent metadata: removing semantic column names yields only small loss in predictive performance (measured AUC retention: 98.8%).
Benchmarks and baselines:
- Compared against gradient-boosted trees (XGBoost/LightGBM), AutoML ensembles (e.g., AutoGluon), current tabular foundation models (TabPFN, TabICLv2, ConTextTab), and large language models applied via serialization.
- Tasks: established row-level prediction benchmarks, missing-value reconstruction, and the new blind dataset sector classification benchmark.
Key empirical results from the paper:
- Schema-1 is first-ranked across all evaluated benchmarks.
- Missing-value reconstruction: lower reconstruction error than classical statistical methods and frontier LLMs (mean performance).
- Blind sector classification: 91.4% top-1 accuracy.
- Metadata-independence quantified: ε = 0.0117 (AUC drop ≤ ε when column semantics removed).

Implications for AI Economics

Reduced recurring engineering cost and faster deployment:
- By internalizing domain identification and preprocessing, DLMs can eliminate repeated manual data understanding, cleaning, encoding, and feature-engineering tasks that currently dominate enterprise ML workflows. This reduces time-to-deployment and ongoing maintenance costs for systems that consume many heterogeneous tabular sources.
Labor and task re-allocation:
- Demand for routine data-preprocessing and feature-engineering labor could decline; demand may shift toward higher-level roles (integration, oversight, domain validation, governance, model auditing).
Competitive dynamics and incumbency advantages:
- Firms that control large, diverse tabular datasets will have an edge in training more capable DLMs, potentially increasing winner-take-most dynamics in enterprise AI foundation models for tabular data.
- Conversely, DLMs lower the engineering barrier for startups and vertical AI entrants by reducing bespoke pipeline requirements, potentially accelerating competition in vertical AI markets.
Productivity and sectoral impacts:
- Sectors highly dependent on tabular data (healthcare, finance, manufacturing, logistics, insurance) stand to capture disproportionate productivity gains from quicker, cheaper, and more autonomous model deployment.
- In high-stakes domains (clinical, financial risk), these gains are tempered by the need for regulatory compliance, interpretability, and safety oversight; misclassification of domain or imputation errors can have outsized economic costs.
Market for tooling and services:
- As preprocessing pipelines become less central, value may shift toward DLM provision, model-monitoring, domain-specific fine-tuning, and compliance/auditing services.
Data-network externalities and privacy:
- The effectiveness of a DLM benefits from scale and diversity of training tables, producing positive network effects that favor large data holders. Privacy, data governance, and competition policy implications follow: dataset pooling or proprietary training corpora could be economically decisive.
Risks and regulatory considerations:
- Automatic domain identification and autonomous action raise governance, fairness, and liability questions. Regulators and firms will need standards for validation, certification, and redress when automatic dataset interpretation drives consequential decisions.
Overall economic takeaway:
- DLMs represent a structural change in the AI stack for enterprise data: by turning tabular data itself into a native modality for foundation models, they can materially lower deployment and maintenance costs, reshape labor demand in data teams, and concentrate economic value around large, diverse dataset holders and DLM providers — while introducing new oversight and regulatory requirements for their safe use.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper presents empirical comparisons on established benchmarks and reports broad improvements (prediction, imputation, dataset classification) after pretraining on a large corpus (>2.3M synthetic and real-world tabular datasets), which lends credibility; however, the abstract does not provide details on benchmark selection, baseline tuning, statistical significance, out-of-sample robustness, or the proportion/representativeness of synthetic versus real datasets, leaving open questions about overfitting to benchmarks and real-world transfer. Methods Rigormedium — Training at scale (140M parameters, millions of datasets) and evaluation across multiple tasks suggests substantial methodological effort, but the abstract lacks crucial details (data curation, preprocessing choices despite the claim of 'native' ingestion, evaluation protocols, baseline configurations, ablations, and reproducibility artifacts) needed to rate rigor as high. SamplePretraining dataset of more than 2.3 million tabular datasets (a mix of synthetic and real-world tables); model is Schema-1, a 140 million parameter Data Language Model. Evaluation tasks reported include row-level prediction benchmarks (compared to gradient-boosted ensembles, AutoML stacks, and existing tabular foundation models), missing value (imputation) reconstruction across multiple conditions, and dataset-level industry-sector classification from raw cell values. Themesinnovation productivity GeneralizabilityMixture of synthetic and real datasets may not reflect the distribution of enterprise or government tabular data (unknown proportions and representativeness)., Benchmarks used may not capture complex real-world tabular issues (time dependence, panel data, high-cardinality categorical variables, nested or relational schemas)., Performance on domain-specific and privacy-sensitive tables (medical, financial) is unclear and may require domain adaptation or legal/ethical safeguards., Claims about eliminating preprocessing pipelines may not hold for real production systems with schema changes, streaming data, or complex feature engineering needs., No evidence provided on causal inference tasks, policy-relevant estimates, or economic outcome prediction at firm/worker level, limiting direct policy or macroeconomic applicability.

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Tabular data does not have a foundation model that understands it natively; every approach to tabular AI today (from gradient-boosted trees to the latest tabular foundation models) requires a preprocessing pipeline before any model can consume the data. Other	negative	high	presence/absence of a native tabular foundation model and the need for preprocessing pipelines	0.03
A Data Language Model (DLM) understands tables the way a language model understands sentences: natively, without serialization or preprocessing, directly from raw cell values. Other	positive	high	ability to consume tabular data directly from raw cell values without preprocessing	0.18
Schema-1 is the first DLM: a 140M parameter model trained on more than 2.3M synthetic and real-world tabular datasets. Other	positive	high	model size and training data scale (number of datasets)	n=2300000 140M parameters; >2.3M datasets 0.3
Schema-1 outperforms gradient-boosted ensembles, AutoML stacks, and the tabular foundation models we evaluate on established row-level prediction benchmarks. Output Quality	positive	high	row-level prediction performance (benchmark predictive performance)	0.18
On missing value reconstruction, Schema-1 achieves lower reconstruction error than all classical statistical methods and frontier large language models on mean performance across conditions. Error Rate	positive	high	missing value reconstruction error (imputation error)	0.18
Schema-1 identifies the industry sector of any unseen dataset from raw cell values alone, reliably across any domain—a task no prior tabular model can perform. Output Quality	positive	medium	accuracy/reliability of industry-sector identification from raw tabular data	0.11
A DLM (Schema-1) eliminates the preprocessing pipelines that currently stand between raw tabular data and AI systems that consume it. Organizational Efficiency	positive	medium	presence/absence or reduction of preprocessing pipeline steps required before modeling	0.02