The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

AI agents fall short on domain-specific data science: in a 17-task benchmark across six industries, AI-only systems performed near or below the median human competitors, while human–AI teams produced the strongest solutions, underscoring the continued importance of human expertise.

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science
An Luo, Jin Du, Xun Xian, Robert Specht, Fangqiao Tian, Ganghua Wang, Xuan Bi, Charles Fleming, Ashish Kundu, Jayanth Srinivasa, Mingyi Hong, Rui Zhang, Tianxi Li, Galin Jones, Jie Ding · March 19, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
In the AgentDS benchmark and open competition, current AI agents perform at or below the median human competitor on domain-specific data-science tasks, and the best results come from human-AI collaborative teams.

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform near or below the median of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .

Summary

Main Finding

AgentDS shows that current agentic AI cannot yet replace human expertise in domain-specific data science. AI-only systems (both direct prompting and autonomous agents) underperform relative to the best human teams, while hybrid human–AI collaboration produces the strongest results. Domain knowledge and multimodal reasoning remain decisive for high-quality outcomes.

Key Points

  • Scope and outcome
    • AgentDS: 17 challenges across 6 domains (commerce, food production, healthcare, insurance, manufacturing, retail banking).
    • Competition: 29 teams, 80 participants, 10-day event. Participants could use any AI tools.
  • Performance summary
    • GPT-4o direct prompting baseline: overall quantile score 0.143 (rank 17/29), below the participant median (0.156).
    • Claude Code agentic baseline: overall quantile score 0.458 (rank 10/29), better than many humans but below top teams.
    • Top-performing solutions came from human–AI collaborative workflows; several teams abandoned fully autonomous agents during the contest.
  • Failure modes of AI agents
    • Poor integration of multimodal signals (images, PDFs, JSON/text) when domain-specific visual/textual cues matter.
    • Over-reliance on generic pipelines (default preprocessing + off-the-shelf models).
    • Limited domain reasoning, difficulty diagnosing modeling failures, and limited feature engineering creativity.
  • Design choices that stressed domain expertise
    • Synthetic but realistic datasets designed so naive pipelines underperform and domain-informed feature extraction yields substantial gains.
    • Multimodal inputs intentionally embed crucial latent signals outside the primary tabular data.

Data & Methods

  • Benchmark construction
    • 17 challenges across 6 industries; tasks include classification, regression, ranking with domain-appropriate metrics (Macro-F1, RMSE, NDCG@10, normalized Gini, MSE, MAE).
    • Data curation pipeline: domain research → synthetic data generation embedding latent signals in additional modalities → theoretical performance bounds (via known DGP) → documentation and expert validation.
    • Datasets include primary tabular tables plus additional modalities (images, text, PDFs, JSON).
  • Evaluation framework
    • Challenge-specific metrics mapped to a common [0,1] quantile score per challenge: qi = (n − ri) / (n − 1).
    • Domain score = mean quantile across challenges in domain; overall score = mean across six domains.
    • Tie-breakers based on efficiency (# submissions, earlier submission time).
  • Competition protocol
    • Teams could freely use AI tools; up to 100 submissions per challenge.
    • Code and reports collected post-competition for reproducibility and analysis.
  • AI-only baselines
    • Direct prompting: GPT-4o given data and description.md; single-shot code generation executed to produce submission.
    • Agentic baseline: Claude Code (autonomous, iterative code execution) with a 10-minute time budget per challenge.
    • Baselines inserted into participant pool and ranked by the same quantile procedure for comparability.
  • Empirical analysis
    • Quantitative comparison of baseline and human team scores by overall, domain, and challenge levels.
    • Qualitative code inspection to identify reasoning/engineering patterns and failure modes.

Implications for AI Economics

  • Labor substitution vs. augmentation
    • Evidence favors augmentation: AI accelerates implementation but does not substitute expert domain reasoning. Human-AI collaboration outperforms both humans and AI-only, implying complementarity between AI tools and skilled workers.
    • Short-run: demand for skilled data scientists who can steer AI, perform domain feature engineering, and diagnose complex failures is likely to remain strong.
    • Long-run: routine parts of data science (code scaffolding, basic modeling) are already automatable; wage pressure may arise for purely routine roles, while premiums grow for domain expertise and hybrid skill sets.
  • Productivity and firm strategy
    • Firms adopting AI-agent tools can raise productivity by reducing implementation time, but gains depend on human expertise to realize domain-specific performance improvements.
    • Returns to AI investment will be heterogeneous: firms with stronger domain experts (or better human–AI integration processes) capture outsized benefits.
  • Skill-biased technical change and inequality
    • The results point to a skill-biased effect: complementary tasks (domain reasoning, multimodal intuition, strategy) command higher value, suggesting increased wage dispersion across data workers based on domain capital and AI-collaboration skills.
    • Smaller firms lacking domain experts or the ability to integrate AI effectively may be slower to realize gains, potentially widening firm-level gaps.
  • Market for AI tools and services
    • Demand shifts toward AI that supports interactive, human-guided workflows and improved multimodal integration rather than fully autonomous agents for complex domain tasks.
    • Opportunity for specialized toolchains (domain adapters, multimodal feature extractors, explainability/diagnostic modules) and services that help firms operationalize human–AI collaboration.
  • R&D and investment priorities
    • Prioritize research on: effective multimodal grounding, embedding domain knowledge (structured priors, domain-specific models), interactive debugging/interpretability, and systems that facilitate human control over agentic workflows.
    • Benchmarks like AgentDS are economically useful: they reveal where automation is feasible versus where human capital remains necessary, guiding R&D allocation and training investments.
  • Policy and workforce implications
    • Training programs should emphasize domain knowledge, model interpretability, and human–AI orchestration skills.
    • Policymakers should monitor labor-market transitions, support reskilling toward complementary roles, and consider targeted support for sectors where domain expertise remains central.
  • Empirical research directions for economists
    • Measure causal impacts of AI-assisted tools on productivity and wages across firm types and sectors.
    • Study task-level automation potential: decompose data science into routine/automatable vs. domain-specific tasks and quantify reallocation effects.
    • Analyze adoption heterogeneity: which firms capture value from AI adoption and which workers benefit versus face displacement.

Caveats to keep in mind when drawing economic conclusions - AgentDS uses synthetic datasets designed to reward domain reasoning; real-world datasets may vary in how much domain knowledge is required. - Competition setting (10 days, contest incentives) differs from long-term industry deployment and iterative model maintenance. - Baselines reflect particular agent implementations and time budgets; future agent improvements could shift the balance.

Overall, AgentDS supports a view of near-term AI as an augmenting technology for data science rather than a wholesale substitute for domain experts. Economic effects will therefore be driven by complementarities, changes in required skill bundles, and heterogeneity in firms’ ability to integrate human–AI collaboration.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Empirical competition with multiple teams and systematic AI-only baselines provides direct comparative evidence on agent vs human performance, but the design is not causal, participants are self-selected, task set is limited, and scoring/judging choices may influence outcomes. Methods Rigormedium — Strengths: a curated benchmark spanning 17 tasks across six industries, open datasets, multiple human teams and AI baselines, and an organized competition; Weaknesses: likely non-random, self-selected participants, unspecified or potentially subjective scoring/aggregation procedures, limited transparency on inter-rater reliability and model/version controls, and absence of pre-registered evaluation protocols. SampleAgentDS benchmark of 17 domain-specific data-science challenges across six industries (commerce, food production, healthcare, insurance, manufacturing, retail banking); open competition with 29 teams and 80 participants; comparisons against AI-only baselines (LLM/agent solutions); datasets released on Hugging Face. Themeshuman_ai_collab productivity GeneralizabilitySelf-selected competition participants may not represent the broader population of data scientists or organizations, Tasks are limited to 17 curated challenges and six industries, so findings may not hold for other domains or task types, AI baselines reflect particular model versions, prompts, and agent implementations that will change rapidly, Competition/lab setting may differ from real-world production environments (scale, integration, stakes), Scoring and evaluation procedures (judging criteria, inter-rater consistency) may limit comparability to other assessments

Claims (10)

ClaimDirectionConfidenceOutcomeDetails
Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Organizational Efficiency positive high transforming complex data into actionable insights
0.03
Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. Organizational Efficiency positive high automation of data science workflow
0.09
We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. Other positive high benchmark for evaluating AI agents and human-AI collaboration
0.18
AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. Other null_result high number of challenges and industry coverage
n=17
0.3
We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Other null_result high competition participation enabling comparison
n=80
0.18
Our results show that current AI agents struggle with domain-specific reasoning. Decision Quality negative high domain-specific reasoning performance
n=80
0.18
AI-only baselines perform near or below the median of competition participants. Output Quality negative high relative performance rank of AI-only baselines vs participants
n=80
0.18
The strongest solutions arise from human-AI collaboration. Team Performance positive high performance of human-AI collaborative solutions
n=80
0.18
These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science. Organizational Efficiency mixed high implications for automation vs. human expertise
n=80
0.18
The AgentDS benchmark datasets are open-sourced and available at https://huggingface.co/datasets/lainmn/AgentDS. Other positive high availability of datasets
0.3

Notes