A granular LLM-and-expert assessment finds data-heavy cognitive occupations unusually exposed to automation while hands-on trades and high-stakes caregiving remain resilient; liability and compliance concerns create an emerging wage 'compliance premium' that could reshape who benefits from AI. The paper provides a diagnostic exposure index for 923 U.S. occupations but does not demonstrate realized employment or wage effects.

Bounded by Risk, Not Capability: Quantifying AI Occupational Substitution Rates via a Tech-Risk Dual-Factor Model

Shuyao Gao, Minghao Huang · April 06, 2026

arxiv descriptive low evidence 7/10 relevance Source PDF

Using an LLM ensemble plus expert HITL validation on 2,087 DWAs across 923 occupations, the paper constructs Relative Occupational Automation Indices showing high exposure for non-routine symbolic cognitive roles (OAI ≈ 0.70) and low exposure for unstructured physical and high-stakes caretaking roles, highlighting an emergent 'Compliance Premium'.

The deployment of Large Language Models (LLMs) has ignited concerns about technological unemployment. Existing task-based evaluations predominantly measure theoretical "exposure" to AI capabilities, ignoring critical frictions of real-world commercial adoption: liability, compliance, and physical safety. We argue occupations are not eradicated instantaneously, but gradually encroached upon via atomic actions. We introduce a Tech-Risk Dual-Factor Model to re-evaluate this. By deconstructing 923 occupations into 2,087 Detailed Work Activities (DWAs), we utilize a multi-agent LLM ensemble to score both technical feasibility and business risk. Through variance-based Human-in-the-Loop (HITL) validation with an expert panel, we demonstrate a profound cognitive gap: isolated algorithmic probabilities fail to encapsulate the "institutional premium" imposed by experts bounded by professional liability. Applying a strictly algorithmic baseline via mathematical bottleneck aggregation, we calculate Relative Occupational Automation Indices ($OAI$) for the U.S. labor market. Our findings challenge the traditional Routine-Biased Technological Change (RBTC) hypothesis. Non-routine cognitive roles highly dependent on symbolic manipulation (e.g., Data Scientists) face unprecedented exposure ($OAI \approx 0.70$). Conversely, unstructured physical trades and high-stakes caretaking roles exhibit absolute resilience, quantifying a profound "Cognitive Risk Asymmetry." We hypothesize the emergent necessity of a "Compliance Premium," indicating wage resilience increasingly tied to risk-absorption capacity. We frame these findings as a cross-sectional diagnostic of systemic vulnerability, establishing a foundation for subsequent Computable General Equilibrium (CGE) econometric modeling involving dynamic wage elasticity and structural labor reallocation.

Summary

Main Finding

The paper develops a Tech‑Risk Dual‑Factor Model that measures occupational replaceability by separating purely technical capability from commercial deployment risk. Using 2,087 Detailed Work Activities (DWAs) derived from O*NET and a multi‑model LLM ensemble (validated with a 31‑expert human panel), the authors compute a risk‑adjusted Occupational Automation Index (OAI) for U.S. work. Key empirical findings: non‑routine cognitive roles that rely on symbolic manipulation (e.g., Data Scientists) exhibit high exposure (OAI ≈ 0.70), while unstructured physical trades and high‑stakes caretaking roles show strong resilience. A measurable “Cognitive Risk Asymmetry” (+0.35 shift in human risk ratings vs. algorithmic baseline) and an emergent “Compliance Premium” (wage resilience linked to risk‑absorption capacity) are identified. The paper argues that risk, not capability alone, governs real‑world substitution and that task encroachment is gradual and non‑linear.

Key Points

Conceptual advance: disentangles AI technical feasibility from institutional/commercial risk and operationalizes both at the atomic action (DWA) level.
Granularity: 923 occupations → 2,087 DWAs (atomic actions) used as the unit of analysis.
Dual scores per DWA:
- Tech Level (0–3): AI technical feasibility to execute autonomously.
- Risk Score (1–5): business/legal/safety consequence of AI failure.
Mass scoring: ensemble of four open‑source LLMs (Qwen2.5‑32B, Gemma‑2‑27b‑it, Llama‑3.1‑8B, Mistral‑Nemo‑Instruct‑2407) run locally (quantized) to produce baseline capability and risk probabilities.
Human‑in‑the‑Loop (HITL) validation:
- 31 cross‑disciplinary experts (Technology cohort n=11; Risk & Management cohort n=20) across US/China/Korea.
- Expert panel was epistemically pre‑qualified to rule out ignorance as the cause of elevated risk perceptions.
- Variance‑based stratified sampling (Consensus, Slight Friction, Severe Divergence) targeted DWAs with differing model agreement.
Cognitive Risk Asymmetry: human management experts systematically assign higher risk ratings than the algorithmic baseline in ambiguous cases; Ordered Logit: evaluator dummy β = 0.65 (p < 0.001). Wilcoxon matched pairs W = 130.5, p < 0.01.
Non‑linear institutional mapping: the model does not simply add human risk inflation to the AI baseline. Instead, it applies non‑linear caps:
- If Risk = 4 → Automation Index capped at AI = 0.3.
- If Risk = 5 → absolute veto (AI = 0). This avoids “double penalization” (inflating the input and also applying a severe degradation filter).
Empirical pattern: many DWAs cluster in medium‑to‑high risk zones (R ≥ 3), implying deployment friction despite technical feasibility.
Challenges RBTC: LLMs target non‑routine cognitive work; the distribution of exposure is shifting away from classic routine‑biased narratives.
Policy/market implication: emergence of a “Compliance Premium” — wages and employment outcomes increasingly tied to an occupation’s ability to absorb institutional/legal risk.

Data & Methods

Primary data: O*NET v30.2, decomposed into 2,087 Detailed Work Activities (DWAs) representing atomic actions.
LLM ensemble and execution:
- Models: Qwen2.5‑32B‑Instruct, Gemma‑2‑27b‑it, Meta‑Llama‑3.1‑8B‑Instruct, Mistral‑Nemo‑Instruct‑2407.
- Hardware: local deployment on dual NVIDIA RTX 3090; models quantized with Q4 K M GGUF.
- Prompting: zero‑shot instructional framing to elicit Tech Level (0–3) and Risk Score (1–5).
- Ensemble outputs averaged and rounded to produce integer Ti and Ri (nearest integer).
HITL protocol:
- Variance‑based stratification: 100 sampled DWAs split into Consensus (σ2 = 0, n = 49), Slight Friction (σ2 = 0.25, n = 17), Severe Divergence (σ2 ≥ 0.33, n = 34).
- Expert panel composition: Technology cohort (n = 11), Risk & Management cohort (n = 20); epistemic qualification provided to Risk cohort.
- Manipulation check: Tech Level Spearman’s ρ = 0.876 (p = 8.60e‑33), confirming expert comprehension of technical capabilities.
Statistical tests:
- Spearman correlations on risk alignment: lower alignment for management experts (ρ ≈ 0.526) vs. tech experts (ρ ≈ 0.569) relative to AI baseline.
- Wilcoxon signed‑rank test confirmed significant divergence in paired risk ratings (W = 130.5, p < 0.01).
- Ordered Logit Model showed human evaluators more likely to assign higher ordinal risk (β = 0.65, p < 0.001).
Aggregation into Occupational Automation Index (OAI):
- Each DWA mapped through the Tech‑Risk dual‑factor mapping function (non‑linear caps/veto rules) to compute an action‑level automation probability.
- DWAs aggregated within occupations to yield occupational OAIs for the U.S. labor market (paper reports e.g., Data Scientists OAI ≈ 0.70).
Methodological choice: the authors intentionally kept raw LLM risk outputs separate from human institutional premiums to avoid conflating different conceptually distinct sources of friction.

Implications for AI Economics

Forecasting: exposure‑only forecasts overestimate near‑term substitution; risk‑adjusted measures better predict realistic adoption and short‑to‑medium term labor impacts.
Labor reallocation and inequality:
- Non‑routine cognitive roles are particularly exposed to capability substitution, but realized substitution depends on risk regimes; some high‑exposure roles may still retain sizable tasks if risks are high.
- A Compliance Premium may preserve wages for roles that absorb legal/compliance risk (e.g., licensed professionals), potentially altering wage dynamics and inequality patterns.
Policy and regulation:
- Liability regimes, certification, and verification systems will materially affect adoption rates; reducing tail risks (through audits, warranties, explainability, better O&M) can lower the institutional premium and increase automation.
- Policies that alter the cost of risk (e.g., malpractice rules, insurance markets) will shift OAIs and hence labor market outcomes—worth modeling explicitly in CGE and structural labor models.
Firm behavior and investment:
- Firms will balance technical capability and expected legal/safety costs; investments in risk‑mitigation (human oversight, multi‑stage verification, insurance) may determine which tasks are automated.
- Adoption is likely heterogeneous across industries and countries depending on regulatory stringency and liability allocation.
Modeling economics of AI:
- Macro models should incorporate non‑linear, task‑level risk constraints (vetoes and caps), not only capability penetration rates.
- The paper provides inputs and a framework for Computable General Equilibrium (CGE) modeling of dynamic wage elasticity and structural reallocation that account for risk‑dependent adoption.
Research agenda:
- Quantify how changes in regulation, liability insurance markets, verification tools, and improvements in model reliability reduce the Cognitive Risk Asymmetry over time.
- Extend cross‑country analyses to capture how legal/institutional differences alter the Compliance Premium and substitution paths.
- Study firm‑level strategies (reskilling, hybrid workflows) that mediate task encroachment under different risk regimes.

Overall, the paper reframes occupational automation as a dual problem of capability plus institutional risk, provides a concrete, validated task‑level measurement approach, and highlights policy levers (liability, verification, insurance) that will shape the realized labor impact of generative AI.

Assessment

Paper Typedescriptive Evidence Strengthlow — The paper produces cross-sectional exposure indices derived from LLM ensemble scores and expert judgement rather than observed causal impacts on employment, wages, or productivity; results are informative about potential vulnerability but do not establish causal effects or realized economic outcomes. Methods Rigormedium — Strengths include a fine-grained decomposition (2,087 DWAs across 923 occupations), use of an LLM ensemble, and a variance-based human-in-the-loop validation; weaknesses include reliance on algorithmic scoring that depends on rapidly evolving models, opaque aggregation assumptions (mathematical bottleneck), limited details about the expert panel size/selection, and lack of validation against realized labor-market outcomes. SampleOccupational decomposition covering 923 U.S. occupations mapped into 2,087 Detailed Work Activities (DWAs); LLM ensemble used to score technical feasibility and business risk for each DWA; variance-based human-in-the-loop validation with an expert panel (panel size and composition not reported); indices aggregated into Relative Occupational Automation Indices (OAI) for the U.S. labor market. Themeslabor_markets adoption governance human_ai_collab skills_training GeneralizabilityCross-sectional exposure measures do not map directly to realized employment or wage outcomes, Results are calibrated to current-generation LLMs and will change as models and safety/compliance tools evolve, Expert panel judgments may be context-, country-, and institution-specific; panel details not reported, Aggregation and aggregation-weight choices (bottleneck math) impose assumptions that may not hold across sectors, U.S.-focused occupational taxonomy may not generalize to other institutional or regulatory environments

Claims (11)

Claim	Direction	Confidence	Outcome	Details
Existing task-based evaluations predominantly measure theoretical "exposure" to AI capabilities, ignoring critical frictions of real-world commercial adoption: liability, compliance, and physical safety. Automation Exposure	negative	high	theoretical automation exposure measurement practices	0.18
Occupations are not eradicated instantaneously, but gradually encroached upon via atomic actions. Job Displacement	negative	high	process of occupational change / displacement	0.03
We introduce a Tech-Risk Dual-Factor Model that jointly scores technical feasibility and business risk to re-evaluate occupational exposure to LLMs. Automation Exposure	positive	high	joint technical feasibility and business risk scores	0.09
We deconstructed 923 occupations into 2,087 Detailed Work Activities (DWAs). Automation Exposure	neutral	high	coverage of occupations and DWAs used for analysis	n=923 2,087 DWAs 0.3
We utilize a multi-agent LLM ensemble to score both technical feasibility and business risk for DWAs. Automation Exposure	positive	high	LLM-derived technical feasibility and business risk scores	0.18
Variance-based Human-in-the-Loop (HITL) validation with an expert panel demonstrates a profound cognitive gap: isolated algorithmic probabilities fail to encapsulate the "institutional premium" imposed by experts bounded by professional liability. Automation Exposure	negative	medium	difference between algorithmic probabilities and expert-assessed risk (institutional premium)	0.11
Using a strictly algorithmic baseline (mathematical bottleneck aggregation), we calculate Relative Occupational Automation Indices (OAI) for the U.S. labor market based on the DWA-level scores. Automation Exposure	neutral	high	Relative Occupational Automation Index (OAI)	n=923 0.18
Non-routine cognitive roles highly dependent on symbolic manipulation (e.g., Data Scientists) face unprecedented exposure, with OAI ≈ 0.70. Automation Exposure	positive	high	Relative Occupational Automation Index (OAI) for Data Scientists	OAI ≈ 0.70 0.18
Unstructured physical trades and high-stakes caretaking roles exhibit absolute resilience to LLM-driven automation (i.e., very low OAI), quantifying a 'Cognitive Risk Asymmetry.' Automation Exposure	negative	medium	Relative Occupational Automation Index (OAI) for unstructured physical trades and caretaking roles	0.11
These findings challenge the traditional Routine-Biased Technological Change (RBTC) hypothesis by showing substantial exposure among non-routine cognitive occupations. Automation Exposure	mixed	medium	pattern of occupational exposure relative to RBTC predictions	n=923 0.11
We hypothesize the emergent necessity of a 'Compliance Premium,' indicating wage resilience increasingly tied to risk-absorption capacity. Wages	positive	high	wage resilience tied to compliance/risk-absorption capacity	0.03