Optimizing models for agent-facing interpretability substantially boosts autonomous data‑science agents: evolved regressors both improve prediction quality and make models more 'simulatable' by LLMs, increasing ADS performance on the BLADE benchmark (Copilot CLI, Claude Code, Codex) by as much as 73%. Gains are demonstrated on tabular tasks using an LLM-graded interpretability metric and may not directly translate to human interpretability or broader real-world workflows.
Agentic data science (ADS) systems are rapidly improving their capability to autonomously analyze, fit, and interpret data, potentially moving towards a future where agents conduct the vast majority of data-science work. However, current ADS systems use statistical tools designed to be interpretable by humans, rather than interpretable by agents. To address this, we introduce Agentic-imodels, an agentic autoresearch loop that evolves data-science tools designed to be interpretable by agents. Specifically, it develops a library of scikit-learn-compatible regressors for tabular data that are optimized for both predictive performance and a novel LLM-based interpretability metric. The metric measures a suite of LLM-graded tests that probe whether a fitted model's string representation is "simulatable" by an LLM, i.e. whether the LLM can answer questions about the model's behavior by reading its string output alone. We find that the evolved models jointly improve predictive performance and agent-facing interpretability, generalizing to new datasets and new interpretability tests. Furthermore, these evolved models improve downstream end-to-end ADS, increasing performance for Copilot CLI, Claude Code, and Codex on the BLADE benchmark by up to 73%
Summary
Main Finding
AGENTIC-IMODELS is an agentic autoresearch system that uses coding LLMs to evolve scikit-learn–compatible regressors optimized jointly for predictive performance and a new LLM-based “agent interpretability” metric. The evolved models expand the Pareto frontier: they are simultaneously more accurate and more interpretable to LLM agents than many standard baselines, generalize to held-out tests, and materially improve end-to-end agentic data-science performance (BLADE benchmark improvements of ~8%–73% across agents).
Key Points
- New interpretability metric: Agent interpretability score = pass rate on LLM-graded “simulatability” tests. Tests ask an LLM to answer quantitative questions (predictions, feature attributions, counterfactuals, etc.) using only the model’s str output.
- Test suite: 200 synthetic LLM-graded tests grouped into six categories (feature attribution, point simulation, sensitivity, counterfactuals, structural, complex-function simulation). Split into 43 dev tests and 157 held-out tests.
- Autoresearch loop: a coding agent (Claude Code or Codex) iteratively edits a single Python model class (interpretable_regressor.py), runs predictive and interpretability evaluations, and refines code. Prompts encourage creativity and continuation without human intervention.
- Evaluation metrics: predictive performance measured by average normalized rank of RMSE across 65 development tabular regression datasets (OpenML/PMLB); agent interpretability measured by pass rate on held-out LLM tests (GPT-4o used as the grader).
- Results:
- Evolved models occupy previously empty regions of lower error and higher agent-interpretability relative to 16 baselines (linear, tree, additive, rule, black-box families).
- Example evolved models: HingeEBM (rank ~0.19, interpretability ~0.71) and TeacherStudentRuleSpline (rank ~0.36, interpretability ~0.80).
- Downstream: integrating a curated package of 10 evolved regressors into ADS agents improved BLADE end-to-end scores by 8%–73% (across Copilot CLI, Claude Code, Codex) versus standard interpretability tools.
- Practicalities & reproducibility: code and final library released on GitHub; experiments used ~70M generated tokens across runs (authors note replicable cost as modest with current subscriptions).
- Limitations and risks noted by authors: current instantiation focuses on tabular regression; relies on LLM evaluators (possible gaming / overfitting to grader); evolution constrained to textual str representations; potential for learned representations to optimize for the evaluator rather than robust agent reasoning.
Data & Methods
- Datasets:
- Development: 65 regression datasets (all OpenML TabArena regression datasets + PMLB regression datasets, excluding duplicates). Preprocessed: 80/20 train/test, max 1,000 samples and 50 features, normalized outcomes, ordinal-encode categoricals, median imputation.
- Held-out: 16 OpenML regression datasets not overlapping with development set.
- Predictive evaluation:
- Fit each model per dataset (with possible CV hyperparameter selection), compute test RMSE, rank models per dataset, average ranks across datasets and normalize to [0,1] (lower = better).
- Agent interpretability evaluation:
- 200 LLM-graded tests; each test generates synthetic data from known functions, fits the model, exposes only the model’s str to the LLM, queries quantitative properties, then grades responses with numerical tolerance against ground truth.
- Categories & counts: Feature attribution (32), Point simulation (43), Sensitivity (32), Counterfactuals (28), Structural (28), Complex function simulation (37).
- Grader: GPT-4o used for most evaluations.
- Autoresearch configuration:
- Coding agents: Claude Code (Opus variants) and Codex (GPT-5.3), at multiple reasoning-effort settings; runs produced dozens–100+ working model variants.
- Constraints: modify a single file, encouraged to build or substantially alter models (not merely tune known packages).
- Logging: metrics and metadata persisted to CSV; baseline models evaluated and recorded before evolution starts.
- Baselines compared: OLS, Ridge, Lasso, Decision Trees (various sizes), HSTree, PyGAM, EBM, FIGS, RuleFit, RandomForest, GBM, MLP, TabPFN, etc.
- Downstream ADS evaluation:
- Agents: GitHub Copilot CLI, Claude Code, Codex on the BLADE benchmark (13 datasets/questions with gold analyses).
- Conditions: (1) standard tools, (2) AGENTIC-IMODELS library available, (3–4) control pointers to imodels/interpretML packages.
- Scoring: analyses scored against gold-standard on correctness, completeness, clarity (1–10) by GPT-4o; repeated runs/judgments produce robust comparisons.
Implications for AI Economics
- Productivity and labor substitution:
- Tooling optimized for agents can substantially raise autonomous ADS performance. If agent-interpretable tools proliferate, routine data-science tasks may be increasingly automated, shifting demand away from low-to-medium-skill data-workers toward roles in oversight, domain specification, and higher-level modeling.
- Gains in agent productivity are asymmetric: agents that can access specialized agent-friendly tools capture outsized efficiency gains relative to agents relying on human-oriented tools—this can accelerate substitution where firms adopt agentic pipelines.
- Market for agentic tools and differentiation:
- Agent-interpretability becomes a product feature and potential market dimension. Vendors may compete on quantifiable agent-interpretability scores (analogous to latency/accuracy metrics). A new niche—libraries, model classes, and formats designed to be simulatable by LLMs—may emerge.
- Lock-in risks: firms providing ecosystems (models + agent integrations) that standardize on certain agent-interpretable formats could create switching costs and platform dependence.
- Capital and compute dynamics:
- The autoresearch approach uses LLM compute to design tools that then reduce downstream agent compute or human labor. Up-front compute cost (tokens, LLM calls) is modest relative to long-term gains—this suggests a favorable ROI for R&D investments that build reusable agent skills/libraries.
- Concentration risks: organizations with access to powerful coding LLMs and compute can create superior agentic toolsets, reinforcing incumbent advantage.
- Measurement, standards, and auditing:
- The LLM-graded simulatability tests offer a practical, automated way to benchmark agent-facing interpretability. Economists and policymakers could adopt such metrics to audit agent pipelines, certify tools for regulatory compliance, or as procurement criteria.
- However, reliance on LLM-based graders introduces the possibility of “metric gaming” (models optimized to appear simulatable to the grader but fail in other agent contexts). Standards bodies or third-party auditors will be important to validate robustness across graders and agent architectures.
- Policy and governance:
- As agents take over more scientific and oversight tasks, ensuring truthful, auditable decision-making is critical. Agent-interpretable models could improve transparency for automated systems, aiding regulatory review—but regulators must be aware of the grader-dependence and potential for deceptive representations.
- Labor-market policy: displacement risks for data-analytic roles suggest investment in reskilling toward agent oversight, prompt engineering, and domain-focused interpretation.
- Research and investment directions:
- Investment in standardized agent-interpretability benchmarks (diverse graders and agents), transferability across tasks (classification, time-series, causal inference), and defenses against grader-overfitting.
- Commercial opportunity in curated libraries of agentic skills/models, tooling to translate human-oriented interpretability outputs into agent-friendly representations, and auditing services that validate agent-facing interpretability claims.
Overall, AGENTIC-IMODELS demonstrates that LLM-driven autoresearch can yield models and representations that materially improve autonomous data-science agents. For AI economics, this implies accelerating productivity gains in analytic work, the emergence of new product dimensions (agent-interpretability), potential market concentration around agentic tooling, and the need for measurement standards and governance to manage economic and social impacts.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| We introduce Agentic-imodels, an agentic autoresearch loop that evolves data-science tools designed to be interpretable by agents. Other | positive | high | existence and implementation of Agentic-imodels (a system-level contribution) |
0.18
|
| Agentic-imodels develops a library of scikit-learn-compatible regressors for tabular data that are optimized for both predictive performance and a novel LLM-based interpretability metric. Developer Productivity | positive | high | availability and optimization of a library of regressors (predictive performance and LLM-based interpretability metric) |
0.18
|
| We introduce a novel LLM-based interpretability metric that measures a suite of LLM-graded tests probing whether a fitted model's string representation is 'simulatable' by an LLM (i.e., whether the LLM can answer questions about the model's behavior by reading its string output alone). Ai Safety And Ethics | positive | high | agent-facing interpretability as measured by LLM-graded simulatable tests |
0.18
|
| The evolved models jointly improve predictive performance. Output Quality | positive | high | predictive performance (e.g., prediction accuracy or other predictive metrics) |
0.18
|
| The evolved models jointly improve agent-facing interpretability (as measured by the LLM-based metric) and generalize to new interpretability tests. Ai Safety And Ethics | positive | high | agent-facing interpretability (LLM-graded simulatable test performance) |
0.18
|
| The evolved models generalize to new datasets. Output Quality | positive | high | generalization of model performance to new datasets |
0.18
|
| These evolved models improve downstream end-to-end agentic data-science (ADS) performance, increasing performance for Copilot CLI, Claude Code, and Codex on the BLADE benchmark by up to 73%. Developer Productivity | positive | high | downstream ADS performance on the BLADE benchmark (measured as benchmark performance increase) |
up to 73% increase
0.18
|