Stable Geometry, Reversing Poles: The Bipolar Structure of AI Occupational Substitutability and Its Decade-Scale Inversion

Empirical research on the labor-market impact of artificial intelligence has converged, since Frey and Osborne (2017), on a continuous-gradient representation in which each occupation is assigned a real-valued exposure score on [0,1] obtained by linear aggregation across capability dimensions. This continuity is rarely articulated as an assumption and has not been tested at the micro-action level where substitution actually occurs. We decompose 1,961 O*NET Detailed Work Activities into 15,817 micro-actions using a multi-agent LLM pipeline with 31-expert HITL calibration, then project the DWA-level Occupational Automation Index from our prior work onto a 7-macro semantic typology. The result is a bipolar structure. Tool-Mediated Physical (M2, mean OAI = 0.054) and Planning & Design (M7, mean OAI = 0.499) form two extremes separated by Cohen's d = 2.41 (H = 172.88, p = 6.21e-34). The geometry is robust under three independent stress tests: resolution (K=7 to K=15, polar gap widens from 0.45 to 0.57), encoder swap to BGE (LLM-class OAI lead replicates at 3.37x), and Eloundou's GPT-4 task ratings (DWA-level rho = 0.635). The six middle macros form a low-contrast band between the poles (TOST at d=0.2 admits only 1/15 pairs as equivalent), not a flat plain. The geometry's stability does not, however, extend to its content. Across a decade, the polarity has inverted. Frey-Osborne (2013) placed Tool-Mediated Physical near the highest computerisation risk and Planning & Design near the lowest; our LLM-era OAI reverses that order, with macro-level FO-Eloundou Spearman rho = -0.750, p = 0.020, against the original Oxford Martin appendix. Which pole is high is therefore contingent on the era's dominant capability frontier, while the stable geometry itself is the structurally robust object.

Summary

Main Finding

Occupational AI-substitutability is not a smooth single-dimensional gradient but has a stable bipolar geometry at the micro-action level: two extreme clusters (Tool‑Mediated Physical vs Planning & Design) flank a low‑contrast middle band. That geometry is robust to embedding, clustering resolution, and alternative exposure indicators — but which pole is the high‑exposure pole depends on the era’s dominant AI capability frontier. Comparing 2013 (Frey & Osborne) to LLM‑era ratings shows a decade-scale polarity inversion: the same structural poles swap high/low exposure.

Key Points

Data scope: 1,961 O*NET Detailed Work Activities (DWAs) decomposed into 15,817 micro‑actions using a multi‑agent LLM pipeline with 31‑expert human‑in‑the‑loop calibration.
Clustering pipeline: Sentence‑BERT/MPNet embeddings → UMAP reduction → HDBSCAN → 35 micro‑clusters → hierarchical Ward linkage → 7 macro‑clusters.
Bipolar result: Two extreme macros — M2 (Tool‑Mediated Physical Execution; mean OAI = 0.054) and M7 (Planning & Design; mean OAI = 0.499) — are strongly separated (Cohen’s d = 2.41; H = 172.88, p = 6.21×10−34).
Middle band: Six remaining macros form a low‑contrast band rather than a flat plateau; equivalence tests (TOST at d = 0.2) accept only 1 of 15 pairwise equivalences among them.
Robustness tests:
- Resolution sweep (K = 7 → 15): polar gap widens from 0.45 to 0.57.
- Encoder/labeler swap: alternative encoder/label family replicates the LLM‑class exposure lead (reported replication magnitude noted).
- Alternative indicator: Eloundou et al.’s GPT‑4 task ratings reproduce the bipolar pattern (DWA‑level ρ = 0.635; Cliff’s δ = 0.902 for M2 vs M7).
Temporal inversion: The macro ordering under Frey & Osborne (2013) is the near exact inverse of the LLM‑era OAI ordering (macro Spearman ρ = −0.750, p = 0.020).
Mechanism/theory: A four‑way intelligence‑type taxonomy (Linguistic / Multimodal Perception / Embodied / Human‑Bound) predicts OAI (H = 527.6). Which intelligence type dominates (i.e., the era’s capability frontier) determines which macro maps to high exposure — explaining polarity inversion across eras.
Methodological contribution: a reproducible multi‑layer clustering + validation cascade (Hartigan dip test, KS tests, Bonferroni‑corrected nonparametric pairwise tests, two‑one‑sided equivalence tests, resolution sweep).

Data & Methods

Primary inputs:
- 1,961 O*NET Detailed Work Activities (DWAs).
- Prior DWA‑level Automation Index (OAI) from Gao et al. (2026) used as the exposure measure to project onto micro‑action clusters.
- External indicators for validation: Frey & Osborne (2013) Oxford Martin appendix and Eloundou et al. (2024) GPT‑4 task ratings.
Micro‑action decomposition:
- Multi‑agent LLM pipeline generated 15,817 micro‑action descriptions from DWA text, calibrated with a 31‑expert human panel.
Representation and clustering:
- Sentence embeddings (Sentence‑BERT / MPNet family; encoder swaps run with BGE as robustness check).
- Dimensionality reduction via UMAP.
- Density clustering via HDBSCAN → 35 micro‑clusters.
- Aggregation into 7 macro‑clusters using hierarchical Ward linkage.
Statistical tests and validation:
- Distributional shape tests (Hartigan dip).
- Pairwise contrasts: Mann‑Whitney / nonparametric testing with Bonferroni correction.
- Effect sizes: Cohen’s d, Cliff’s δ.
- Equivalence testing: two‑one‑sided tests (TOST) at d = 0.2.
- Resolution sweep across K = 7–15 to check stability of macro geometry.
- External alignment: correlate and compare macro orderings with Frey & Osborne (2013) and Eloundou et al. (2024).
Notable numerical results: 15,817 micro‑actions; Cohen’s d = 2.41 for M2 vs M7; H = 172.88, p = 6.21×10−34; polar gap 0.45 → 0.57 across resolutions; DWA‑level rho with GPT‑4 = 0.635; macro Spearman rho with FO2013 = −0.750 (p = 0.020); intelligence‑type H = 527.6 (means: LLM‑class 0.427 vs others 0.127).

Implications for AI Economics

Question the continuous‑gradient default. Summary scalar scores (occupation → single [0,1] exposure) can obscure a bipolar micro‑action structure; cutoffs and “moderate/high” labels risk misclassifying occupations whose constituent actions lie at different poles.
Forecasts are era‑dependent. Single‑point exposure forecasts that apply contemporaneous capability frontiers to project displacement are vulnerable: the structural geometry can persist while content (which clusters are high‑risk) flips as dominant model capabilities change. Forecasters should separate (a) geometry (structure of clusters/poles) from (b) polarity (which pole maps to high exposure) and quantify uncertainty in both.
Targeting policy and reskilling. Policy interventions (retraining, social insurance, sectoral transition plans) should work at the action/task level and explicitly consider which intelligence types (linguistic, perception, embodied, human‑bound) are exposed under present and plausible future capability frontiers. Different waves of capability will shift which types of work are most at risk.
Methodological recommendations for empirical work:
- Use micro‑action decomposition and embedding‑based clustering to detect non‑continuous structure before imposing scalar aggregation.
- Validate across encoders, resolutions, and independent exposure indicators (human, LLM) to distinguish stable geometry from era‑contingent polarity.
- Report both structural geometry (cluster topology, polar gaps) and content assignment (which clusters are high exposure) separately.
Theory refinement: Incorporate capability‑era dependence into models of task displacement. Models should allow the mapping from task types to substitutability to change with the evolving capability frontier (e.g., linguistic/LLM advances vs embodied/robotics advances), not treat the mapping as time‑invariant.
Practical research agenda:
- Replicate across other national occupational taxonomies and languages.
- Track polarity shifts over shorter timescales as model families advance (LLMs, multimodal models, embodied agents).
- Integrate micro‑action exposure measures into macro projections (aggregate employment, wage distribution) while explicitly modeling polarity uncertainty.

Limitations (brief): results depend on the LLM decomposition pipeline and labelling choices, and on OAI construction; cross‑country generalisability and how rapidly polarity can change with future capability shifts remain open empirical questions.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides robust descriptive evidence about the semantic geometry of automation exposure using large-scale decomposition, multiple stress tests (resolution, encoder swap, external ratings), and statistical tests; however it does not identify causal effects on economic outcomes and relies on LLM-derived decompositions and calibrated human-in-the-loop judgments, which creates measurement and construct-validity limits. Methods Rigormedium — The authors apply a systematic multi-agent LLM pipeline, 31-expert HITL calibration, replication with a different encoder (BGE), and external comparison to GPT-4 ratings and historical Frey-Osborne scores, with appropriate statistical testing; nevertheless the approach depends on model-generated micro-actions, choice of semantic clustering (K), and mapping procedures that may introduce subjective choices and model-era biases. Sample1,961 O*NET Detailed Work Activities (DWAs) were decomposed into 15,817 micro-actions using a multi-agent large language model pipeline with 31-expert human-in-the-loop calibration; the authors project a previously constructed DWA-level Occupational Automation Index (OAI) onto a 7-cluster (macro) semantic typology and evaluate robustness via alternative resolutions (K=7..15), encoder swap to BGE, and comparison to independent GPT-4 task ratings and the Frey-Osborne (Oxford Martin) appendix. Themeslabor_markets adoption GeneralizabilityRelies on O*NET DWAs (U.S.-centric occupational taxonomy) which may not map cleanly to other countries' job descriptions, Decomposition depends on contemporary LLM behavior and training data; results may change as LLMs and capability frontiers evolve, Human-in-the-loop calibration introduces expert judgment that may reflect sample of 31 experts rather than broader consensus, Macro-clustering choice (number of clusters and semantic labels) and mapping of OAI to macros are researcher choices that limit transferability, Findings describe exposure geometry, not realized automation outcomes—generalization to actual firm-level adoption or wage effects is limited

Claims (13)

Claim	Direction	Confidence	Outcome	Details
Empirical research since Frey and Osborne (2017) has converged on a continuous-gradient representation in which each occupation is assigned a real-valued exposure score on [0,1] obtained by linear aggregation across capability dimensions. Other	null_result	high	use of continuous-gradient occupational exposure scores (OAI-style representation)	0.09
We decomposed 1,961 O*NET Detailed Work Activities (DWAs) into 15,817 micro-actions using a multi-agent LLM pipeline with 31-expert human-in-the-loop (HITL) calibration. Task Allocation	null_result	high	task decomposition (DWAs to micro-actions)	n=1961 15,817 micro-actions 0.3
Projecting the DWA-level Occupational Automation Index (OAI) onto a 7-macro semantic typology produces a bipolar structure (two poles separated by a low-contrast middle band). Automation Exposure	mixed	high	structure of macro-level OAI distribution (bipolarity between macros)	n=7 0.3
Tool-Mediated Physical (macro M2) has mean OAI = 0.054. Automation Exposure	negative	high	mean Occupational Automation Index (OAI) for macro M2	mean OAI = 0.054 0.18
Planning & Design (macro M7) has mean OAI = 0.499. Automation Exposure	positive	high	mean Occupational Automation Index (OAI) for macro M7	mean OAI = 0.499 0.18
Tool-Mediated Physical (M2) and Planning & Design (M7) are separated by Cohen's d = 2.41 (H = 172.88, p = 6.21e-34). Automation Exposure	mixed	high	effect size (standardized mean difference) between macro M2 and M7 OAI distributions	Cohen's d = 2.41 0.3
The inferred geometry is robust under a resolution stress test: when K (number of clusters) is varied from 7 to 15 the polar gap widens from 0.45 to 0.57. Automation Exposure	positive	high	polar gap (distance between poles) as clustering resolution varies	polar gap widens from 0.45 to 0.57 0.18
The geometry replicates under an encoder swap to BGE: 'LLM-class OAI lead' replicates at 3.37x. Automation Exposure	positive	medium	replication of LLM-derived OAI lead when using alternate embedding encoder (BGE)	replicates at 3.37x 0.11
The geometry replicates against Eloundou et al.'s GPT-4 task ratings: DWA-level correlation rho = 0.635. Automation Exposure	positive	high	correlation between DWA-level OAI and GPT-4 task ratings	n=1961 rho = 0.635 0.3
The six middle macros form a low-contrast band between the poles; equivalence testing (TOST at d = 0.2) admits only 1 out of 15 macro-pair comparisons as equivalent. Automation Exposure	null_result	high	pairwise equivalence among middle macros (TOST results)	TOST at d=0.2 admits only 1/15 pairs as equivalent 0.18
Although the geometry (bipolar structure) is stable, its content is not: across a decade the polarity has inverted relative to Frey and Osborne (2013). Automation Exposure	mixed	high	change in directionality of macro-level automation risk (polarity) over time	0.18
Macro-level correlation between Frey-Osborne (2013) and Eloundou-era rankings is Spearman rho = -0.750, p = 0.020 (against the original Oxford Martin appendix), indicating inversion. Automation Exposure	negative	high	Spearman correlation between historical and current macro-level automation-risk rankings	n=7 Spearman rho = -0.750, p = 0.020 0.18
Which pole is higher in automation exposure is contingent on the era's dominant capability frontier, while the bipolar geometry itself is structurally robust. Automation Exposure	mixed	medium	dependence of pole ranking on era-specific capability frontiers; stability of geometry	0.02