A granular LLM-driven decomposition of 1,961 O*NET activities uncovers a stable bipolar structure of automation exposure—Tool‑Mediated Physical and Planning & Design sit at opposite poles—but the pole labeled 'high risk' has inverted since the Frey‑Osborne era, implying exposure rankings depend on the prevailing AI capability frontier.
Empirical research on the labor-market impact of artificial intelligence has converged, since Frey and Osborne (2017), on a continuous-gradient representation in which each occupation is assigned a real-valued exposure score on [0,1] obtained by linear aggregation across capability dimensions. This continuity is rarely articulated as an assumption and has not been tested at the micro-action level where substitution actually occurs. We decompose 1,961 O*NET Detailed Work Activities into 15,817 micro-actions using a multi-agent LLM pipeline with 31-expert HITL calibration, then project the DWA-level Occupational Automation Index from our prior work onto a 7-macro semantic typology. The result is a bipolar structure. Tool-Mediated Physical (M2, mean OAI = 0.054) and Planning & Design (M7, mean OAI = 0.499) form two extremes separated by Cohen's d = 2.41 (H = 172.88, p = 6.21e-34). The geometry is robust under three independent stress tests: resolution (K=7 to K=15, polar gap widens from 0.45 to 0.57), encoder swap to BGE (LLM-class OAI lead replicates at 3.37x), and Eloundou's GPT-4 task ratings (DWA-level rho = 0.635). The six middle macros form a low-contrast band between the poles (TOST at d=0.2 admits only 1/15 pairs as equivalent), not a flat plain. The geometry's stability does not, however, extend to its content. Across a decade, the polarity has inverted. Frey-Osborne (2013) placed Tool-Mediated Physical near the highest computerisation risk and Planning & Design near the lowest; our LLM-era OAI reverses that order, with macro-level FO-Eloundou Spearman rho = -0.750, p = 0.020, against the original Oxford Martin appendix. Which pole is high is therefore contingent on the era's dominant capability frontier, while the stable geometry itself is the structurally robust object.
Summary
Main Finding
Occupational AI-substitutability is not a smooth single-dimensional gradient but has a stable bipolar geometry at the micro-action level: two extreme clusters (Tool‑Mediated Physical vs Planning & Design) flank a low‑contrast middle band. That geometry is robust to embedding, clustering resolution, and alternative exposure indicators — but which pole is the high‑exposure pole depends on the era’s dominant AI capability frontier. Comparing 2013 (Frey & Osborne) to LLM‑era ratings shows a decade-scale polarity inversion: the same structural poles swap high/low exposure.
Key Points
- Data scope: 1,961 O*NET Detailed Work Activities (DWAs) decomposed into 15,817 micro‑actions using a multi‑agent LLM pipeline with 31‑expert human‑in‑the‑loop calibration.
- Clustering pipeline: Sentence‑BERT/MPNet embeddings → UMAP reduction → HDBSCAN → 35 micro‑clusters → hierarchical Ward linkage → 7 macro‑clusters.
- Bipolar result: Two extreme macros — M2 (Tool‑Mediated Physical Execution; mean OAI = 0.054) and M7 (Planning & Design; mean OAI = 0.499) — are strongly separated (Cohen’s d = 2.41; H = 172.88, p = 6.21×10−34).
- Middle band: Six remaining macros form a low‑contrast band rather than a flat plateau; equivalence tests (TOST at d = 0.2) accept only 1 of 15 pairwise equivalences among them.
- Robustness tests:
- Resolution sweep (K = 7 → 15): polar gap widens from 0.45 to 0.57.
- Encoder/labeler swap: alternative encoder/label family replicates the LLM‑class exposure lead (reported replication magnitude noted).
- Alternative indicator: Eloundou et al.’s GPT‑4 task ratings reproduce the bipolar pattern (DWA‑level ρ = 0.635; Cliff’s δ = 0.902 for M2 vs M7).
- Temporal inversion: The macro ordering under Frey & Osborne (2013) is the near exact inverse of the LLM‑era OAI ordering (macro Spearman ρ = −0.750, p = 0.020).
- Mechanism/theory: A four‑way intelligence‑type taxonomy (Linguistic / Multimodal Perception / Embodied / Human‑Bound) predicts OAI (H = 527.6). Which intelligence type dominates (i.e., the era’s capability frontier) determines which macro maps to high exposure — explaining polarity inversion across eras.
- Methodological contribution: a reproducible multi‑layer clustering + validation cascade (Hartigan dip test, KS tests, Bonferroni‑corrected nonparametric pairwise tests, two‑one‑sided equivalence tests, resolution sweep).
Data & Methods
- Primary inputs:
- 1,961 O*NET Detailed Work Activities (DWAs).
- Prior DWA‑level Automation Index (OAI) from Gao et al. (2026) used as the exposure measure to project onto micro‑action clusters.
- External indicators for validation: Frey & Osborne (2013) Oxford Martin appendix and Eloundou et al. (2024) GPT‑4 task ratings.
- Micro‑action decomposition:
- Multi‑agent LLM pipeline generated 15,817 micro‑action descriptions from DWA text, calibrated with a 31‑expert human panel.
- Representation and clustering:
- Sentence embeddings (Sentence‑BERT / MPNet family; encoder swaps run with BGE as robustness check).
- Dimensionality reduction via UMAP.
- Density clustering via HDBSCAN → 35 micro‑clusters.
- Aggregation into 7 macro‑clusters using hierarchical Ward linkage.
- Statistical tests and validation:
- Distributional shape tests (Hartigan dip).
- Pairwise contrasts: Mann‑Whitney / nonparametric testing with Bonferroni correction.
- Effect sizes: Cohen’s d, Cliff’s δ.
- Equivalence testing: two‑one‑sided tests (TOST) at d = 0.2.
- Resolution sweep across K = 7–15 to check stability of macro geometry.
- External alignment: correlate and compare macro orderings with Frey & Osborne (2013) and Eloundou et al. (2024).
- Notable numerical results: 15,817 micro‑actions; Cohen’s d = 2.41 for M2 vs M7; H = 172.88, p = 6.21×10−34; polar gap 0.45 → 0.57 across resolutions; DWA‑level rho with GPT‑4 = 0.635; macro Spearman rho with FO2013 = −0.750 (p = 0.020); intelligence‑type H = 527.6 (means: LLM‑class 0.427 vs others 0.127).
Implications for AI Economics
- Question the continuous‑gradient default. Summary scalar scores (occupation → single [0,1] exposure) can obscure a bipolar micro‑action structure; cutoffs and “moderate/high” labels risk misclassifying occupations whose constituent actions lie at different poles.
- Forecasts are era‑dependent. Single‑point exposure forecasts that apply contemporaneous capability frontiers to project displacement are vulnerable: the structural geometry can persist while content (which clusters are high‑risk) flips as dominant model capabilities change. Forecasters should separate (a) geometry (structure of clusters/poles) from (b) polarity (which pole maps to high exposure) and quantify uncertainty in both.
- Targeting policy and reskilling. Policy interventions (retraining, social insurance, sectoral transition plans) should work at the action/task level and explicitly consider which intelligence types (linguistic, perception, embodied, human‑bound) are exposed under present and plausible future capability frontiers. Different waves of capability will shift which types of work are most at risk.
- Methodological recommendations for empirical work:
- Use micro‑action decomposition and embedding‑based clustering to detect non‑continuous structure before imposing scalar aggregation.
- Validate across encoders, resolutions, and independent exposure indicators (human, LLM) to distinguish stable geometry from era‑contingent polarity.
- Report both structural geometry (cluster topology, polar gaps) and content assignment (which clusters are high exposure) separately.
- Theory refinement: Incorporate capability‑era dependence into models of task displacement. Models should allow the mapping from task types to substitutability to change with the evolving capability frontier (e.g., linguistic/LLM advances vs embodied/robotics advances), not treat the mapping as time‑invariant.
- Practical research agenda:
- Replicate across other national occupational taxonomies and languages.
- Track polarity shifts over shorter timescales as model families advance (LLMs, multimodal models, embodied agents).
- Integrate micro‑action exposure measures into macro projections (aggregate employment, wage distribution) while explicitly modeling polarity uncertainty.
Limitations (brief): results depend on the LLM decomposition pipeline and labelling choices, and on OAI construction; cross‑country generalisability and how rapidly polarity can change with future capability shifts remain open empirical questions.
Assessment
Claims (13)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Empirical research since Frey and Osborne (2017) has converged on a continuous-gradient representation in which each occupation is assigned a real-valued exposure score on [0,1] obtained by linear aggregation across capability dimensions. Other | null_result | high | use of continuous-gradient occupational exposure scores (OAI-style representation) |
0.09
|
| We decomposed 1,961 O*NET Detailed Work Activities (DWAs) into 15,817 micro-actions using a multi-agent LLM pipeline with 31-expert human-in-the-loop (HITL) calibration. Task Allocation | null_result | high | task decomposition (DWAs to micro-actions) |
n=1961
15,817 micro-actions
0.3
|
| Projecting the DWA-level Occupational Automation Index (OAI) onto a 7-macro semantic typology produces a bipolar structure (two poles separated by a low-contrast middle band). Automation Exposure | mixed | high | structure of macro-level OAI distribution (bipolarity between macros) |
n=7
0.3
|
| Tool-Mediated Physical (macro M2) has mean OAI = 0.054. Automation Exposure | negative | high | mean Occupational Automation Index (OAI) for macro M2 |
mean OAI = 0.054
0.18
|
| Planning & Design (macro M7) has mean OAI = 0.499. Automation Exposure | positive | high | mean Occupational Automation Index (OAI) for macro M7 |
mean OAI = 0.499
0.18
|
| Tool-Mediated Physical (M2) and Planning & Design (M7) are separated by Cohen's d = 2.41 (H = 172.88, p = 6.21e-34). Automation Exposure | mixed | high | effect size (standardized mean difference) between macro M2 and M7 OAI distributions |
Cohen's d = 2.41
0.3
|
| The inferred geometry is robust under a resolution stress test: when K (number of clusters) is varied from 7 to 15 the polar gap widens from 0.45 to 0.57. Automation Exposure | positive | high | polar gap (distance between poles) as clustering resolution varies |
polar gap widens from 0.45 to 0.57
0.18
|
| The geometry replicates under an encoder swap to BGE: 'LLM-class OAI lead' replicates at 3.37x. Automation Exposure | positive | medium | replication of LLM-derived OAI lead when using alternate embedding encoder (BGE) |
replicates at 3.37x
0.11
|
| The geometry replicates against Eloundou et al.'s GPT-4 task ratings: DWA-level correlation rho = 0.635. Automation Exposure | positive | high | correlation between DWA-level OAI and GPT-4 task ratings |
n=1961
rho = 0.635
0.3
|
| The six middle macros form a low-contrast band between the poles; equivalence testing (TOST at d = 0.2) admits only 1 out of 15 macro-pair comparisons as equivalent. Automation Exposure | null_result | high | pairwise equivalence among middle macros (TOST results) |
TOST at d=0.2 admits only 1/15 pairs as equivalent
0.18
|
| Although the geometry (bipolar structure) is stable, its content is not: across a decade the polarity has inverted relative to Frey and Osborne (2013). Automation Exposure | mixed | high | change in directionality of macro-level automation risk (polarity) over time |
0.18
|
| Macro-level correlation between Frey-Osborne (2013) and Eloundou-era rankings is Spearman rho = -0.750, p = 0.020 (against the original Oxford Martin appendix), indicating inversion. Automation Exposure | negative | high | Spearman correlation between historical and current macro-level automation-risk rankings |
n=7
Spearman rho = -0.750, p = 0.020
0.18
|
| Which pole is higher in automation exposure is contingent on the era's dominant capability frontier, while the bipolar geometry itself is structurally robust. Automation Exposure | mixed | medium | dependence of pole ranking on era-specific capability frontiers; stability of geometry |
0.02
|