The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

A systematic benchmark finds LLMs are most capable on math and programming text tasks (SAFI ~72–73) and weakest on interpersonal comprehension tasks, while real-world adoption data suggest the majority of AI interactions augment rather than automate jobs. The authors also document a 'capability–demand inversion' where heavily AI-exposed occupations demand skills LLMs perform relatively poorly on.

The AI Skills Shift: Mapping Skill Obsolescence, Emergence, and Transition Pathways in the LLM Era
Rudra Jadhav, Janhavi Danve · April 08, 2026
arxiv descriptive low evidence 7/10 relevance Source PDF
Using a 263-task benchmark mapped to O*NET, the paper presents SAFI showing LLMs score highest on mathematics and programming tasks and lowest on active listening and reading comprehension, finds a capability-demand inversion for AI-exposed jobs, and reports that most observed AI interactions are augmentative rather than fully automating tasks.

As Large Language Models reshape the global labor market, policymakers and workers need empirical data on which occupational skills may be most susceptible to automation. We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs -- LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash -- across 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy (1,052 total model calls, 0% failure rate). Cross-referencing with real-world AI adoption data from the Anthropic Economic Index (756 occupations, 17,998 tasks), we propose an AI Impact Matrix -- an interpretive framework that positions skills along four quadrants: High Displacement Risk, Upskilling Required, AI-Augmented, and Lower Displacement Risk. Key findings: (1) Mathematics (SAFI: 73.2) and Programming (71.8) receive the highest automation feasibility scores; Active Listening (42.2) and Reading Comprehension (45.5) receive the lowest; (2) a "capability-demand inversion" where skills most demanded in AI-exposed jobs are those LLMs perform least well at in our benchmark; (3) 78.7% of observed AI interactions are augmentation, not automation; (4) all four models converge to similar skill profiles (3.6-point spread), suggesting that text-based automation feasibility may be more skill-dependent than model-dependent. SAFI measures LLM performance on text-based representations of skills, not full occupational execution. All data, code, and model responses are open-sourced.

Summary

Main Finding

Frontier LLMs (LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, Gemini 2.5 Flash) show systematically higher text‑based automation feasibility for structured technical skills (Mathematics SAFI 73.2; Programming SAFI 71.8) and lower feasibility for nuanced content/social skills (Active Listening SAFI 42.2; Reading Comprehension SAFI 45.5). Real-world usage data indicate AI interactions are predominantly augmentation (78.7%), producing a “capability–demand inversion”: the skills most concentrated in AI‑exposed occupations are those LLMs perform worst at in text‑based benchmarks. SAFI is a text‑task index (0–100) and does not directly measure full occupational automation.

Key Points

  • SAFI: Skill Automation Feasibility Index computed from 263 purpose‑designed text tasks covering all 35 O*NET skills, averaged across 4 LLMs (1,052 responses). Score range 0–100.
  • Top and bottom skills (text benchmark):
    • Highest: Mathematics 73.2; Programming 71.8.
    • Lowest: Active Listening 42.2; Reading Comprehension 45.5; Speaking 48.5; Writing 51.0.
  • Augmentation versus automation (Anthropic Economic Index):
    • 78.7% of observed AI–task interactions are augmentation (feedback loops, learning), 21.3% are directive automation.
  • Cross‑model convergence:
    • Narrow overall spread across models (3.6-point mean spread); Mistral Large highest overall SAFI 60.0; LLaMA 58.2; Qwen 56.7; Gemini 56.4.
  • Correlations:
    • Programming importance correlates positively with AI exposure (+0.455).
    • SAFI vs. real-world exposure correlation negative (Pearson r = −0.196, p = 0.26; Spearman ρ = −0.300, p = 0.08) — consistent negative direction but not statistically significant at α=0.05.
  • Scoring method: heuristic multi-signal rubric (Response Completeness 0–3, Depth 0–3, Reasoning 0–2, Difficulty bonus 0–2; total 0–10) to avoid LLM-as-judge bias.
  • All data, code, and model responses are open-sourced by the authors.

Data & Methods

  • Data sources:
    • O*NET v30.2 (35 skills across ~1,016 occupations).
    • Anthropic Economic Index releases (Feb 2025–Mar 2026): job exposure for 756 occupations, 17,998 tasks, 3,364 task interaction records classified by interaction mode.
    • Authors’ LLM benchmark: 263 text tasks (easy/medium/hard per skill), 4 models, 1,052 successful model calls (0% failure).
  • Models & setup:
    • LLaMA 3.3 70B (open-source), Mistral Large (closed), Qwen 2.5 72B (open-source), Gemini 2.5 Flash (closed).
    • Identical prompts, temperature 0.3.
  • Task design:
    • Tasks designed to reflect O*NET skill definitions in text form (e.g., coding and debugging for Programming; contradiction identification for Reading Comprehension). Tasks do not capture embodied/real‑time interaction.
  • SAFI computation:
    • Normalized average of heuristic scores across tasks and models per skill (see formula in paper). Reflects LLM performance on textual skill representations.
  • Limitations noted by authors:
    • SAFI measures text-based performance, not full occupational execution.
    • Scoring artifact: harder tasks elicited longer/more structured outputs and thus higher scores (reported as limitation).
    • Small N at the skill-aggregation level (n=35) reduces statistical power for correlations.

Implications for AI Economics

  • Augmentation-dominant early adoption: Empirical evidence (Anthropic data + SAFI patterns) implies current AI adoption tends to augment human labor—policy and firm responses should prioritize augmentation pathways and worker complementarities over blanket displacement narratives.
  • Targeted reskilling & curriculum shifts:
    • High SAFI + high exposure (e.g., Programming) identifies near‑term elevated displacement risk for routine, entry‑level tasks; recommended focus: move training from rote task execution to higher‑order system design, AI orchestration, evaluation/debugging of AI outputs.
    • Low SAFI + high exposure (Content/Social skills) suggests investment in human skills where LLMs are currently weaker (empathy, negotiation, contextual judgment) and in “AI‑management” skills (prompt engineering, evaluation protocols).
    • Upskilling Window (physical technical skills): integrate data literacy and AI‑assisted diagnostic tools in vocational/trade training to capture future complementarities before exposure rises.
  • Measurement & monitoring:
    • SAFI provides a replicable, open benchmark for text‑based skill feasibility that policymakers and researchers can use to monitor shifts as models evolve; cross‑model convergence suggests skill‑level assessments may be relatively stable across current frontier systems, but this requires continual updating.
  • Labor market policy:
    • Emphasize policies that subsidize employer‑led on‑the‑job retraining in AI‑assisted workflows, portability of credentials for hybrid technical/communication competencies, and targeted support for workers in occupations with high concentrations of automatable routine tasks.
  • Caution for inference:
    • SAFI is a useful signal but not a direct displacement forecast. Full occupational outcomes will depend on task bundling, labor demand, complementary capital, regulatory decisions, and model deployment design.

If you want, I can: - Extract a short list of occupations most exposed to high‑SAFI skills (to prioritize policy interventions), or - Produce a one‑page policy brief translating these findings into concrete training/program recommendations.

Assessment

Paper Typedescriptive Evidence Strengthlow — The paper provides a systematic benchmark of LLM performance on text-based proxies for occupational skills and correlates those scores with an adoption dataset, but it does not identify causal effects on employment, wages, or firm productivity and relies on constructed text tasks rather than observed workplace task execution, limiting claims about real-world automation or displacement. Methods Rigormedium — The study is transparent and systematic: four state-of-the-art LLMs, 263 text-based tasks mapped to all 35 O*NET skills, 1,052 model calls, and a large occupation-level adoption dataset (Anthropic Economic Index: 756 occupations, 17,998 tasks); results are open-sourced. However, rigor is limited by reliance on text-only task formulations, potential subjectivity in task and prompt design and scoring, snapshot evaluation of specific model versions, and the indirect mapping from benchmark tasks to actual workplace task performance. SampleFour frontier LLMs (LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, Gemini 2.5 Flash) evaluated on 263 text-based tasks covering all 35 O*NET skills (1,052 model calls, 0% failure rate). Cross-referenced with the Anthropic Economic Index covering 756 occupations and 17,998 observed AI-related tasks; mapping links model task performance (SAFI) to occupational skill demand. Themeslabor_markets skills_training adoption human_ai_collab GeneralizabilityText-only tasks: excludes multimodal, sensorimotor, and embodied work that matter for many occupations, Benchmark tasks are proxies and may not capture contextual, team-based, or dynamic aspects of real workplace task execution, Snapshot of specific model versions; results may change as models update or via fine-tuning/tooling, Mapping between benchmark tasks and O*NET job tasks involves subjective choices and potential measurement error, Likely English-language and U.S.-centric O*NET taxonomy limits transferability to other languages and labor markets, Focus on frontier LLMs excludes other forms of automation (specialized software, robotics, domain-specific models) that affect occupations

Claims (10)

ClaimDirectionConfidenceOutcomeDetails
We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs -- LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash -- across 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy (1,052 total model calls, 0% failure rate). Other positive high benchmark coverage and execution success (model calls and failure rate)
n=1052
0% failure rate
0.18
The benchmark covers 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy. Other positive high coverage of O*NET skills by benchmark tasks
n=263
263 tasks across 35 skills
0.18
The study cross-references the SAFI benchmark with real-world AI adoption data from the Anthropic Economic Index covering 756 occupations and 17,998 tasks. Adoption Rate positive high occupations and tasks coverage in cross-reference dataset
n=17998
756 occupations, 17,998 tasks
0.18
We propose an AI Impact Matrix that positions skills into four quadrants: High Displacement Risk, Upskilling Required, AI-Augmented, and Lower Displacement Risk. Other neutral high interpretive classification of skills into four impact quadrants
0.03
Mathematics (SAFI: 73.2) and Programming (71.8) receive the highest automation feasibility scores; Active Listening (42.2) and Reading Comprehension (45.5) receive the lowest. Automation Exposure mixed high SAFI score by skill (automation feasibility)
SAFI: 73.2; Programming: 71.8; Active Listening: 42.2; Reading Comprehension: 45.5
0.18
There is a 'capability-demand inversion' where skills most demanded in AI-exposed jobs are those LLMs perform least well at in our benchmark. Skill Obsolescence negative high relationship between skill demand in AI-exposed jobs and SAFI performance
n=17998
0.18
78.7% of observed AI interactions are augmentation, not automation. Automation Exposure positive high share of AI interactions classified as augmentation vs automation
n=17998
78.7% augmentation
0.18
All four models converge to similar skill profiles (3.6-point spread), suggesting that text-based automation feasibility may be more skill-dependent than model-dependent. Automation Exposure null_result high variation (spread) in SAFI skill profiles across models
n=4
3.6-point spread
0.18
SAFI measures LLM performance on text-based representations of skills, not full occupational execution. Other neutral high scope of SAFI measure (text-based representations vs full job execution)
0.03
All data, code, and model responses are open-sourced. Other positive high availability of study materials (data, code, responses)
0.09

Notes