Optimizing models for agent-facing interpretability substantially boosts autonomous data‑science agents: evolved regressors both improve prediction quality and make models more 'simulatable' by LLMs, increasing ADS performance on the BLADE benchmark (Copilot CLI, Claude Code, Codex) by as much as 73%. Gains are demonstrated on tabular tasks using an LLM-graded interpretability metric and may not directly translate to human interpretability or broader real-world workflows.

Agentic-imodels: Evolving agentic interpretability tools via autoresearch

Chandan Singh, Yan Shuo Tan, Weijia Xu, Zelalem Gero, Weiwei Yang, Michel Galley, Jianfeng Gao · May 05, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Evolving scikit-learn-compatible regressors for agent-facing interpretability produces models that improve predictive performance and an LLM-based 'simulatability' metric and raises end-to-end ADS benchmark performance by up to 73% across several LLM toolchains.

Agentic data science (ADS) systems are rapidly improving their capability to autonomously analyze, fit, and interpret data, potentially moving towards a future where agents conduct the vast majority of data-science work. However, current ADS systems use statistical tools designed to be interpretable by humans, rather than interpretable by agents. To address this, we introduce Agentic-imodels, an agentic autoresearch loop that evolves data-science tools designed to be interpretable by agents. Specifically, it develops a library of scikit-learn-compatible regressors for tabular data that are optimized for both predictive performance and a novel LLM-based interpretability metric. The metric measures a suite of LLM-graded tests that probe whether a fitted model's string representation is "simulatable" by an LLM, i.e. whether the LLM can answer questions about the model's behavior by reading its string output alone. We find that the evolved models jointly improve predictive performance and agent-facing interpretability, generalizing to new datasets and new interpretability tests. Furthermore, these evolved models improve downstream end-to-end ADS, increasing performance for Copilot CLI, Claude Code, and Codex on the BLADE benchmark by up to 73%

Summary

Main Finding

AGENTIC-IMODELS is an agentic autoresearch system that uses coding LLMs to evolve scikit-learn–compatible regressors optimized jointly for predictive performance and a new LLM-based “agent interpretability” metric. The evolved models expand the Pareto frontier: they are simultaneously more accurate and more interpretable to LLM agents than many standard baselines, generalize to held-out tests, and materially improve end-to-end agentic data-science performance (BLADE benchmark improvements of ~8%–73% across agents).

Key Points

New interpretability metric: Agent interpretability score = pass rate on LLM-graded “simulatability” tests. Tests ask an LLM to answer quantitative questions (predictions, feature attributions, counterfactuals, etc.) using only the model’s str output.
Test suite: 200 synthetic LLM-graded tests grouped into six categories (feature attribution, point simulation, sensitivity, counterfactuals, structural, complex-function simulation). Split into 43 dev tests and 157 held-out tests.
Autoresearch loop: a coding agent (Claude Code or Codex) iteratively edits a single Python model class (interpretable_regressor.py), runs predictive and interpretability evaluations, and refines code. Prompts encourage creativity and continuation without human intervention.
Evaluation metrics: predictive performance measured by average normalized rank of RMSE across 65 development tabular regression datasets (OpenML/PMLB); agent interpretability measured by pass rate on held-out LLM tests (GPT-4o used as the grader).
Results:
- Evolved models occupy previously empty regions of lower error and higher agent-interpretability relative to 16 baselines (linear, tree, additive, rule, black-box families).
- Example evolved models: HingeEBM (rank ~0.19, interpretability ~0.71) and TeacherStudentRuleSpline (rank ~0.36, interpretability ~0.80).
- Downstream: integrating a curated package of 10 evolved regressors into ADS agents improved BLADE end-to-end scores by 8%–73% (across Copilot CLI, Claude Code, Codex) versus standard interpretability tools.
Practicalities & reproducibility: code and final library released on GitHub; experiments used ~70M generated tokens across runs (authors note replicable cost as modest with current subscriptions).
Limitations and risks noted by authors: current instantiation focuses on tabular regression; relies on LLM evaluators (possible gaming / overfitting to grader); evolution constrained to textual str representations; potential for learned representations to optimize for the evaluator rather than robust agent reasoning.

Data & Methods

Datasets:
- Development: 65 regression datasets (all OpenML TabArena regression datasets + PMLB regression datasets, excluding duplicates). Preprocessed: 80/20 train/test, max 1,000 samples and 50 features, normalized outcomes, ordinal-encode categoricals, median imputation.
- Held-out: 16 OpenML regression datasets not overlapping with development set.
Predictive evaluation:
- Fit each model per dataset (with possible CV hyperparameter selection), compute test RMSE, rank models per dataset, average ranks across datasets and normalize to [0,1] (lower = better).
Agent interpretability evaluation:
- 200 LLM-graded tests; each test generates synthetic data from known functions, fits the model, exposes only the model’s str to the LLM, queries quantitative properties, then grades responses with numerical tolerance against ground truth.
- Categories & counts: Feature attribution (32), Point simulation (43), Sensitivity (32), Counterfactuals (28), Structural (28), Complex function simulation (37).
- Grader: GPT-4o used for most evaluations.
Autoresearch configuration:
- Coding agents: Claude Code (Opus variants) and Codex (GPT-5.3), at multiple reasoning-effort settings; runs produced dozens–100+ working model variants.
- Constraints: modify a single file, encouraged to build or substantially alter models (not merely tune known packages).
- Logging: metrics and metadata persisted to CSV; baseline models evaluated and recorded before evolution starts.
Baselines compared: OLS, Ridge, Lasso, Decision Trees (various sizes), HSTree, PyGAM, EBM, FIGS, RuleFit, RandomForest, GBM, MLP, TabPFN, etc.
Downstream ADS evaluation:
- Agents: GitHub Copilot CLI, Claude Code, Codex on the BLADE benchmark (13 datasets/questions with gold analyses).
- Conditions: (1) standard tools, (2) AGENTIC-IMODELS library available, (3–4) control pointers to imodels/interpretML packages.
- Scoring: analyses scored against gold-standard on correctness, completeness, clarity (1–10) by GPT-4o; repeated runs/judgments produce robust comparisons.

Implications for AI Economics

Productivity and labor substitution:
- Tooling optimized for agents can substantially raise autonomous ADS performance. If agent-interpretable tools proliferate, routine data-science tasks may be increasingly automated, shifting demand away from low-to-medium-skill data-workers toward roles in oversight, domain specification, and higher-level modeling.
- Gains in agent productivity are asymmetric: agents that can access specialized agent-friendly tools capture outsized efficiency gains relative to agents relying on human-oriented tools—this can accelerate substitution where firms adopt agentic pipelines.
Market for agentic tools and differentiation:
- Agent-interpretability becomes a product feature and potential market dimension. Vendors may compete on quantifiable agent-interpretability scores (analogous to latency/accuracy metrics). A new niche—libraries, model classes, and formats designed to be simulatable by LLMs—may emerge.
- Lock-in risks: firms providing ecosystems (models + agent integrations) that standardize on certain agent-interpretable formats could create switching costs and platform dependence.
Capital and compute dynamics:
- The autoresearch approach uses LLM compute to design tools that then reduce downstream agent compute or human labor. Up-front compute cost (tokens, LLM calls) is modest relative to long-term gains—this suggests a favorable ROI for R&D investments that build reusable agent skills/libraries.
- Concentration risks: organizations with access to powerful coding LLMs and compute can create superior agentic toolsets, reinforcing incumbent advantage.
Measurement, standards, and auditing:
- The LLM-graded simulatability tests offer a practical, automated way to benchmark agent-facing interpretability. Economists and policymakers could adopt such metrics to audit agent pipelines, certify tools for regulatory compliance, or as procurement criteria.
- However, reliance on LLM-based graders introduces the possibility of “metric gaming” (models optimized to appear simulatable to the grader but fail in other agent contexts). Standards bodies or third-party auditors will be important to validate robustness across graders and agent architectures.
Policy and governance:
- As agents take over more scientific and oversight tasks, ensuring truthful, auditable decision-making is critical. Agent-interpretable models could improve transparency for automated systems, aiding regulatory review—but regulators must be aware of the grader-dependence and potential for deceptive representations.
- Labor-market policy: displacement risks for data-analytic roles suggest investment in reskilling toward agent oversight, prompt engineering, and domain-focused interpretation.
Research and investment directions:
- Investment in standardized agent-interpretability benchmarks (diverse graders and agents), transferability across tasks (classification, time-series, causal inference), and defenses against grader-overfitting.
- Commercial opportunity in curated libraries of agentic skills/models, tooling to translate human-oriented interpretability outputs into agent-friendly representations, and auditing services that validate agent-facing interpretability claims.

Overall, AGENTIC-IMODELS demonstrates that LLM-driven autoresearch can yield models and representations that materially improve autonomous data-science agents. For AI economics, this implies accelerating productivity gains in analytic work, the emergence of new product dimensions (agent-interpretability), potential market concentration around agentic tooling, and the need for measurement standards and governance to manage economic and social impacts.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides empirical, benchmark-based evidence that evolved models improve predictive accuracy and an LLM-derived interpretability metric and that these improvements translate to large gains on an end-to-end ADS benchmark (BLADE) with multiple LLM backends. However, claims about broader productivity or labor effects are not directly measured; the interpretability metric is LLM-based (not human-grounded) and may be dataset-, prompt-, or model-dependent, leaving external validity and real-world impact uncertain. Methods Rigormedium — The work appears to use a systematic evolutionary optimization loop, multiple datasets, and several downstream LLM toolchains for evaluation, which indicates careful empirical work. Key limitations include reliance on an LLM-graded interpretability metric that could be sensitive to prompts/LLM choice, limited detail about the diversity and number of datasets, potential for overfitting to the interpretability tests, and no human-subject validation of interpretability or deployment studies. SampleA library of scikit-learn-compatible regressors for tabular data was evolved and evaluated across multiple tabular datasets (unnamed in the summary), with generalization tests to held-out datasets and new interpretability tests; downstream end-to-end evaluation used the BLADE benchmark with three LLM-based ADS toolchains (Copilot CLI, Claude Code, OpenAI Codex). No human subject data are reported. Themesproductivity human_ai_collab GeneralizabilityResults limited to tabular regression tasks and scikit-learn-compatible models; unclear extension to unstructured data (text, images)., Interpretability metric depends on specific LLM graders, prompts, and tests and may not generalize across LLM families or future models., Downstream benchmarks tested only a few ADS toolchains (Copilot CLI, Claude Code, Codex); enterprise or human-in-the-loop settings not evaluated., Potential overfitting to the chosen interpretability tests and BLADE benchmark; robustness to adversarial prompts or real-world pipelines not shown., No evidence on human interpretability, developer productivity, or labor-market impacts, limiting socioeconomic generalizability.

Claims (7)

Claim	Direction	Confidence	Outcome	Details
We introduce Agentic-imodels, an agentic autoresearch loop that evolves data-science tools designed to be interpretable by agents. Other	positive	high	existence and implementation of Agentic-imodels (a system-level contribution)	0.18
Agentic-imodels develops a library of scikit-learn-compatible regressors for tabular data that are optimized for both predictive performance and a novel LLM-based interpretability metric. Developer Productivity	positive	high	availability and optimization of a library of regressors (predictive performance and LLM-based interpretability metric)	0.18
We introduce a novel LLM-based interpretability metric that measures a suite of LLM-graded tests probing whether a fitted model's string representation is 'simulatable' by an LLM (i.e., whether the LLM can answer questions about the model's behavior by reading its string output alone). Ai Safety And Ethics	positive	high	agent-facing interpretability as measured by LLM-graded simulatable tests	0.18
The evolved models jointly improve predictive performance. Output Quality	positive	high	predictive performance (e.g., prediction accuracy or other predictive metrics)	0.18
The evolved models jointly improve agent-facing interpretability (as measured by the LLM-based metric) and generalize to new interpretability tests. Ai Safety And Ethics	positive	high	agent-facing interpretability (LLM-graded simulatable test performance)	0.18
The evolved models generalize to new datasets. Output Quality	positive	high	generalization of model performance to new datasets	0.18
These evolved models improve downstream end-to-end agentic data-science (ADS) performance, increasing performance for Copilot CLI, Claude Code, and Codex on the BLADE benchmark by up to 73%. Developer Productivity	positive	high	downstream ADS performance on the BLADE benchmark (measured as benchmark performance increase)	up to 73% increase 0.18