The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

A new RL-based index finds reinforcement-learning systems most likely to learn operational and control tasks — lifting feasibility for roles like power plant operators, railroad conductors and cargo-handling supervisors — while creative and interpersonal professions score high on existing exposure indices but low on RL learnability, suggesting different policy priorities than current AI-exposure metrics indicate.

What Jobs Can AI Learn? Measuring Exposure by Reinforcement Learning
Philip Moreira Tomei, Bouke Klein Teeselink · May 04, 2026
arxiv descriptive medium evidence 8/10 relevance Source PDF
The authors construct an RL Feasibility Index by scoring all O*NET tasks for how learnable they are by reinforcement-learning–trained systems, showing systematic divergence from existing AI exposure metrics and identifying occupation groups (e.g., power plant operators, railroad conductors) with high RL learnability but low prior exposure scores.

Which jobs can AI learn to do? We examine this for every occupation in the US economy. Existing indices measure the overlap between AI capabilities and occupational tasks rather than which tasks AI systems can learn to perform, and as a result misclassify occupations where the gap between present capability and learnability is large. Reinforcement learning in post-training, now the dominant paradigm at the frontier, is structured around task completion and maps more directly onto the task-based architecture of occupational classifications than prior approaches. Using LLM annotators guided by a rubric developed with RL experts and validated against confirmed deployment cases, we score all 17,951 ONET tasks for training feasibility and aggregate to the occupation level, producing an RL Feasibility Index. The index diverges sharply from existing AI exposure measures for specific occupation groups: power plant operators, railroad conductors, and aircraft cargo handling supervisors score high on RL feasibility but low on general AI exposure, while creative and interpersonal roles (musicians, physicians, natural sciences managers) show the reverse. These divergences carry direct implications for policy interventions.

Summary

Main Finding

The authors produce an RL Feasibility Index that assesses, for every U.S. occupation, how feasible it is to train AI to perform its tasks using reinforcement-learning-based post‑training (e.g., RLHF, RLAIF, RL with verifiable rewards). The index—built from O*NET task descriptions and LLM-based annotation—identifies a different pattern of near‑term automation risk than existing LLM‑exposure measures: many digitally verifiable monitoring and control roles are highly RL‑feasible despite low prior LLM exposure, while creative and high‑judgment roles can show high LLM exposure but low RL feasibility. A suggestive difference‑in‑differences test finds occupations with higher RL exposure have experienced a modest relative decline in job openings after ChatGPT (−2.9% per 1 SD, p ≈ 0.085).

Key Points

  • New forward‑looking index (RL Feasibility Index) targets the dominant frontier training paradigm (reinforcement learning) rather than task‑LLM overlap alone.
  • Coverage: 17,951 O*NET tasks → 894 SOC occupations; index and code are public (GitHub link in paper).
  • Construction:
    • Binary physical feasibility gate: tasks requiring substantial physical embodiment get index = 0 (focus is on software‑trainable tasks).
    • For digital tasks, eight dimensions scored 1–10 (D1–D8): verification method, environment simulability, observability, task variability/expertise breadth, decision depth, feedback density/decomposability, tool/interface accessibility, and output tangibility.
    • Task score = mean of eight dimension scores; occupation score = importance‑weighted mean across tasks (O*NET task importance); rescaled 0–100.
    • Annotation done with LLM evaluators (primary: Gemini 2.5 Flash) using a rubric and forced justifications to reduce arbitrary ratings.
  • Descriptive results:
    • Highest RL feasibility: clerical/data‑processing/information‑handling roles (e.g., data entry keyers, proofreaders, correspondence clerks).
    • Lowest: manually embodied jobs (dishwashers, stonemasons, carpentry helpers) score zero due to the physical gate. 40.7% of tasks receive zero.
    • By SOC major group: Office & Admin Support, Computer & Mathematical, and Business & Financial score highest; Construction, Farming/Fishing/Forestry, and Installation & Maintenance score lowest.
    • Within labor market data (Revelio Labs): RL feasibility is hump‑shaped across wages and seniority—highest for upper‑middle wage and mid‑career roles; lower for the very lowest and highest wage/seniority.
  • Labour market signals:
    • DID on monthly job openings (Aug 2021–Nov 2025): one SD higher RL exposure → ≈ 2.9% fewer job openings after ChatGPT (marginal significance).
    • Event study suggests effects may emerge slowly, aligning with gradual RL advances and firm adoption.
  • Comparison to prior LLM exposure measures (e.g., Eloundou et al. 2024):
    • Overall correlation is high but important divergences occur:
      • High LLM exposure but low RL feasibility: creative, high‑judgment, non‑simulable tasks (musicians, CEOs, some scientists, physicians).
      • High RL feasibility but low LLM exposure: monitoring/control tasks with verifiable outcomes and simulable environments but minimal text (power/chemical plant operators, railroad conductors, cargo supervisors).
    • These divergences matter for policy targeting and retraining priorities.

Data & Methods

  • Primary task universe: O*NET 30.0 (17,951 task statements across 894 occupations at 8‑digit SOC).
  • Annotation approach:
    • Binary physical gate applied first.
    • Eight‑dimension rubric scored 1–10; annotators must produce textual justification for each numeric score (reduces spurious answers).
    • LLM evaluators via OpenRouter API; primary model Gemini 2.5 Flash; alternative models used for robustness checks.
  • Aggregation:
    • Task → occupation using O*NET task importance weights: occupation score = weighted mean of task RLi scores.
    • Task index formula: RLi = ((mean dimension score) − 1) / 9 × 100 (maps all‑1s to 0 and all‑10s to 100).
  • Labor market linkage:
    • Revelio Labs position records (position counts, salaries, seniority) used to analyze wage/seniority correlations.
    • Monthly job openings from Revelio Labs (Aug 2021–Nov 2025) used in difference‑in‑differences and event‑study to examine early labor effects.
  • Validation and robustness:
    • Rubric developed with RL experts, validated against confirmed deployment cases (details in appendices).
    • PCA of the eight dimensions: first PC explains ~65% of variance; tool accessibility is relatively orthogonal (PC2).
  • Limitations called out by authors:
    • Excludes robotics/physical embodiment; RL environments restricted to software/simulations.
    • Reliance on LLM annotators introduces potential annotation bias and model‑dependent judgments despite rubric and justifications.
    • The RL assumption conditions results on specific training pathways (RLHF, RLAIF, RLVR); other algorithmic routes are out of scope.
    • Marginal statistical significance in job‑postings analysis; causal interpretation is tentative.

Implications for AI Economics

  • Measurement: Exposure indices must account for the training paradigm driving frontier progress. RL‑oriented measures can reveal different near‑term automation risks than LLM‑overlap metrics, changing which occupations policy should prioritize.
  • Policy targeting and retraining: Policymakers relying on prior LLM exposure indices may under‑target certain operational monitoring/control occupations (power/chemical plant operators, certain transportation supervisors) that are RL‑feasible but low on LLM metrics. Retraining and safety/regulation efforts should include these roles.
  • Labor market structure and inequality:
    • RL‑feasibility peaks in upper‑middle wage and mid‑seniority roles, suggesting a potential for further job polarization (pressure on middle‑skill roles) and implications for wage dynamics.
    • Employer adoption dynamics matter: RL gains are incremental and adoption lags; monitoring job‑posting trends and firm adoption is important for lead indicators.
  • Research and evaluation priorities:
    • Invest in validated, cross‑model annotation pipelines and ground truthing (deployments, firm adoption case studies) to reduce annotator/model bias.
    • Expand indices to incorporate hybrid pathways (e.g., combined LLM + programmatic toolchains, human‑in‑the‑loop systems) and to quantify complementarities/augmentation effects, not only substitutability.
    • Incorporate robotics/embodiment analyses in parallel to capture physical automation risk.
  • Regulatory and safety considerations:
    • High RL feasibility is often paired with verifiable outputs and simulable environments—opportunities for rigorous testing, sandboxing, and standards setting before wide deployment.
    • For occupations with low RL feasibility but high LLM exposure (creative, judgmental tasks), policy should focus on augmentation, quality control, misinformation risk, and liability frameworks rather than displacement mitigation alone.

Key resources: RL Feasibility Index and code are publicly available per the paper (GitHub link in text).

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper produces a comprehensive, task-level RL Feasibility Index covering all 17,951 O*NET tasks and validates the rubric against confirmed deployment cases, giving empirical substance to its claims; however, the index is based on LLM annotators and rubric judgments rather than independent experimental or real-world rollouts across occupations, so conclusions about real-world impacts and learnability beyond current model families remain provisional. Methods Rigormedium — Methods are systematic: a rubric developed with RL experts, LLM-based annotation at full task coverage, and validation against known deployments; nevertheless, reliance on LLM annotators introduces potential bias from current model capabilities, the validation set and procedures are not described as exhaustive, and subjective rubric interpretation and aggregation choices could materially affect results. SampleAll 17,951 O*NET tasks (U.S. occupational taxonomy) were scored for RL training feasibility using LLM annotators guided by a rubric developed with reinforcement-learning experts; task scores were aggregated to the occupation level to produce an RL Feasibility Index and validated against a set of confirmed AI deployment cases. Themesadoption labor_markets productivity GeneralizabilityBased on U.S. O*NET tasks — may not map precisely to other countries' occupational structures or informal work arrangements, Score depends on current LLM capabilities and the chosen rubric; future model architectures or training paradigms could change learnability substantially, Focuses on RL-style task learnability and may underweight other AI approaches (e.g., supervised fine-tuning, symbolic systems, multimodal hybrids), Validation against confirmed deployments is helpful but likely limited in scope and may not cover edge cases or long-run operational constraints (safety, regulation, cost), Aggregating task-level feasibility to occupations can mask within-occupation heterogeneity and the importance of task complementarities, coordination, and managerial constraints

Claims (7)

ClaimDirectionConfidenceOutcomeDetails
We examine this for every occupation in the US economy. Automation Exposure positive high coverage of US occupations in the RL feasibility analysis
0.3
Existing indices measure the overlap between AI capabilities and occupational tasks rather than which tasks AI systems can learn to perform, and as a result misclassify occupations where the gap between present capability and learnability is large. Automation Exposure negative high accuracy/misclassification of occupations by AI-exposure indices vs. learnability-based index
0.18
Reinforcement learning in post-training, now the dominant paradigm at the frontier, is structured around task completion and maps more directly onto the task-based architecture of occupational classifications than prior approaches. Task Allocation positive medium suitability of RL (post-training) for modeling occupational tasks
0.11
Using LLM annotators guided by a rubric developed with RL experts and validated against confirmed deployment cases, we score all 17,951 O*NET tasks for training feasibility and aggregate to the occupation level, producing an RL Feasibility Index. Automation Exposure positive high training feasibility of O*NET tasks; RL Feasibility Index at task and occupation levels
n=17951
0.3
The index diverges sharply from existing AI exposure measures for specific occupation groups: power plant operators, railroad conductors, and aircraft cargo handling supervisors score high on RL feasibility but low on general AI exposure. Automation Exposure positive high relative RL feasibility vs. general AI exposure for named occupations
0.18
Creative and interpersonal roles (musicians, physicians, natural sciences managers) show the reverse (i.e., they score low on RL feasibility but high on general AI exposure). Automation Exposure negative high relative RL feasibility vs. general AI exposure for named creative/interpersonal occupations
0.18
These divergences carry direct implications for policy interventions. Governance And Regulation mixed high policy relevance of measurement divergences
0.03

Notes