NLP Occupational Emergence Analysis: How Occupations Form and Evolve in Real Time -- A Zero-Assumption Method Demonstrated on AI in the US Technology Workforce, 2022-2026

Occupations form and evolve faster than classification systems can track. We propose that a genuine occupation is a self-reinforcing structure (a bipartite co-attractor) in which a shared professional vocabulary makes practitioners cohesive as a group, and the cohesive group sustains the vocabulary. This co-attractor concept enables a zero-assumption method for detecting occupational emergence from resume data, requiring no predefined taxonomy or job titles: we test vocabulary cohesion and population cohesion independently, with ablation to test whether the vocabulary is the mechanism binding the population. Applied to 8.2 million US resumes (2022-2026), the method correctly identifies established occupations and reveals a striking asymmetry for AI: a cohesive professional vocabulary formed rapidly in early 2024, but the practitioner population never cohered. The pre-existing AI community dissolved as the tools went mainstream, and the new vocabulary was absorbed into existing careers rather than binding a new occupation. AI appears to be a diffusing technology, not an emerging occupation. We discuss whether introducing an "AI Engineer" occupational category could catalyze population cohesion around the already-formed vocabulary, completing the co-attractor.

Summary

Main Finding

The paper introduces a zero-assumption NLP method that defines an occupation as a bipartite "co-attractor" of people and vocabulary, and operationalizes tests for its emergence in resume data. Applied to 8.2 million US resumes (Aug 2022–Jan 2026), the method finds that AI exhibits a striking asymmetry: a tightly cohesive AI tooling vocabulary crystallized rapidly in early 2024, but the practitioner population never formed a mutually sustaining occupational cohort. In short, AI behaves like a diffusing technology absorbed across existing careers, not as a newly emergent occupation — though creating an “AI Engineer” occupational category could, in principle, catalyze population cohesion around the already-formed vocabulary.

Key Points

Co-attractor concept: an occupation is a mutually sustaining pair — a cohesive vocabulary and a cohesive population — each necessary for the other.
Push vs. Pull: compression/algorithmic artifacts (“push”, e.g., frequent tokens like “ai”) can produce apparent topics; genuine semantic co-occurrence among terms and genuine population similarity (“pull”) must be separated.
Dual, symmetric tests:
- Vocabulary cohesion: permutation-tested co-occurrence on XTX (term-term) to detect pull beyond frequency expectations.
- Population cohesion: permutation-tested co-occurrence on XXT (document-document) to detect groups of practitioners who are similar beyond focal terms.
Ablation: removing the candidate vocabulary from resumes and retesting population cohesion determines whether vocabulary is necessary for the observed population cluster (tests mutual dependence).
Trifactor NMF (X ≈ F S Gᵀ): used to independently identify document groups F (populations), term groups G (vocabularies), and the coupling S; large diagonal S[k,k] indicates strong within-topic population–vocabulary coupling.
Empirical AI result: three independent timelines show (1) rapid vocabulary lock-in (early 2024), (2) dissolution/absence of a cohesive AI practitioner population, and (3) population signals driven mainly by generic signaling terms rather than tooling vocabulary.
Validation: method recovers established occupations in held-out tests (paper reports validating on two known occupations).
Interpretation: AI is observed as a diffusing general-purpose technology rather than a self-contained emerging occupation; prior AI practitioner communities fragmented as AI tools became mainstream.

Data & Methods

Data
- Source: BOLD platform ecosystem resumes (users of resume-builder tools).
- Sample: 8.2 million US resumes, August 2022 – January 2026.
- Temporal resolution: monthly windows (42 months).
Preprocessing / representations
- Document-term matrix X (rows = resumes, columns = terms/skills/vocabulary tokens).
- Cosine similarity used for pairwise co-occurrence measures.
Push vs. pull filtration (compressionless co-occurrence test)
- Null construction: independently permute each column of X 200 times to preserve term frequencies but destroy co-occurrence structure.
- For each term pair (or document pair), compare observed similarity to the empirical permutation distribution; mark edges significant after Benjamini–Hochberg correction.
- Result: Boolean masks Mvoc and Mpop that keep only edges driven by pull (co-occurrence beyond frequency).
Validation of groups (hypergeometric density test)
- Given the pull-filtered graph, compute whether a candidate vocabulary or population group has more significant internal edges than expected by chance using a hypergeometric/urn model; report density ratio (>1 indicates over-connection).
Topic/co-cluster discovery
- Trifactor NMF (X ≈ F S Gᵀ) to identify document clusters (F), vocabulary clusters (G), and coupling strengths (S).
- Use S to measure within-topic coupling; use ablation (zero-out vocabulary terms in X and retest population cohesion on XXT) to test necessity of vocabulary for population coherence.
Multiple complementary diagnostics: concentration of F[:,k], cohesion on raw XXT, significant vocabulary density on XTX, and ablation-driven changes.
Statistical thresholds and corrections reported (permutation repeats = 200; multiple testing control via Benjamini–Hochberg).

Implications for AI Economics

Measurement and monitoring
- Existing taxonomies (SOC, O*NET) and five-year surveys lag true occupational dynamics; the paper demonstrates a scalable, high-frequency resume-based approach to detect nascent occupations or diffusion in near–real time.
- The co-attractor tests separate buzzword-driven signals from genuine occupational formation — important for accurate labor market measurement and forecasting.
Labor market interpretation
- If AI is primarily diffusing across occupations rather than forming its own occupation, policies focusing on retraining should emphasize upskilling within existing career ladders (role-specific tool adoption, integration into domain tasks) rather than creating separate AI-only career tracks.
- Employer classification and hiring taxonomies that treat AI expertise as a cross-cutting skill (rather than a discrete occupational label) may better reflect current labor-market structure — unless active steps are taken to institutionalize a new occupation.
Policy and credentialing
- Introducing a formal occupational category (e.g., “AI Engineer”) — via standard occupational classification, credentialing programs, or industry hiring taxonomies — could act as a coordination mechanism and potentially catalyze population cohesion around the already-formed vocabulary (i.e., complete the co-attractor). Whether this is desirable depends on trade-offs (e.g., labor market segmentation vs. clearer training pipelines).
Workforce forecasting and education
- Forecasts that assume rapid emergence of a unified AI occupation may overstate labor reallocation; instead expect AI skill diffusion to alter task content and productivity within many occupations.
- Educational and training programs should prioritize embedding AI tool fluency into domain curricula and continuing education for incumbent workers rather than primarily funding standalone “AI degrees.”
Research and policy caveats
- The method relies on resume text (supply-side signal); employer-side uptake and job-posting dynamics matter for wages and demand. A combined analysis (resumes + job postings + employer surveys) would give a fuller picture.
- Data-source considerations: BOLD’s resume sample is described as broader than LinkedIn’s, but all resume platforms have selection biases that should be acknowledged when generalizing to the whole labor force.
- Ablation establishes necessity of vocabulary for population cohesion in the data, not causal mechanisms in the broader labor market.
Practical recommendation
- Statistical occupational monitoring systems (national statistical agencies, workforce boards, and research centers) should adopt dual-side co-occurrence tests (vocabulary and population) and ablation diagnostics to distinguish occupation formation from technology diffusion. For AI-specific policy, prioritize cross-occupation upskilling pathways while tracking whether institutional actions (classification, credentials) are producing emergent co-attractor dynamics.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Large-scale, time-resolved resume data (8.2M records) and an explicit testable mechanism (vocabulary vs population cohesion with ablation) provide substantive empirical support for the descriptive claim; however, the conclusions are based on proxies (resume text) and observational patterns that are open to alternative explanations (selection into resume datasets, resume-writing practices, platform coverage, offline/industry community formation), limiting causal certainty and external validity. Methods Rigormedium — The method is innovative and appears to include independent tests and ablation to probe mechanism, and it is applied to a large dataset; but rigor depends on choices not reported here (NLP preprocessing, cohesion metrics, thresholds, robustness checks, validation against ground truth), potential measurement error, and sample selection biases that are not fully addressed in the summary. Sample8.2 million U.S. resumes collected 2022–2026 (textual resume fields used to extract professional vocabulary and infer practitioner identity); likely drawn from resume databases/job platforms or aggregated sources; includes temporal stamps enabling early-2024 vocabulary emergence detection. Themeslabor_markets adoption IdentificationDetect occupational 'co-attractors' by independently measuring (1) vocabulary cohesion — the degree to which a shared, distinctive professional vocabulary appears among resume texts — and (2) population cohesion — the extent to which a stable, bounded practitioner population co-occurs with that vocabulary; temporal tracking (2022–2026) identifies emergence, and ablation tests remove vocabulary signals to assess whether vocabulary is the mechanism binding the population. GeneralizabilityNon-representative sample: resumes self-select and over/under-represent particular industries, job-seekers, geographies and demographics., U.S.-only data: findings may not hold in other labor markets or regulatory contexts., Platform/source bias: results depend on where resumes were collected (job boards, recruiters) and omit practitioners active off-platform (e.g., researchers on GitHub, internal teams)., Resume signal limitation: occupational cohesion that forms offline or in non-resume artifacts (conferences, Slack, GitHub) may be missed., Temporal specificity: 2022–2026 captures a particular AI diffusion phase that may not generalize to other periods., Language and sector biases: method likely focused on English-language resumes and may miss domain-specific vocabularies., Measurement/design choices: cohesion metrics, NLP models and ablation implementations could affect results.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Occupations form and evolve faster than classification systems can track. Adoption Rate	positive	high	speed of occupation formation / evolution relative to classification updates	n=8200000 0.18
A genuine occupation is a self-reinforcing structure (a bipartite co-attractor) in which a shared professional vocabulary makes practitioners cohesive as a group, and the cohesive group sustains the vocabulary. Other	positive	high	conceptual definition of occupation formation (vocabulary ↔ population cohesion)	0.03
The co-attractor concept enables a zero-assumption method for detecting occupational emergence from resume data, requiring no predefined taxonomy or job titles: we test vocabulary cohesion and population cohesion independently, with ablation to test whether the vocabulary is the mechanism binding the population. Adoption Rate	positive	high	ability to detect occupational emergence (via vocabulary cohesion and population cohesion metrics)	n=8200000 0.18
Applied to 8.2 million US resumes (2022-2026), the method correctly identifies established occupations. Adoption Rate	positive	high	accuracy / correctness of detected occupations (established occupations identified)	n=8200000 0.18
For AI: a cohesive professional vocabulary formed rapidly in early 2024, but the practitioner population never cohered. Adoption Rate	mixed	high	vocabulary cohesion (rapid formation) and population cohesion (absence of cohesion)	n=8200000 0.18
The pre-existing AI community dissolved as the tools went mainstream, and the new vocabulary was absorbed into existing careers rather than binding a new occupation. Employment	negative	medium	population cohesion / absorption into existing careers (dissolution of standalone AI community)	n=8200000 0.05
AI appears to be a diffusing technology, not an emerging occupation. Adoption Rate	negative	high	status of AI as technology diffusion versus occupation formation	n=8200000 0.18
Introducing an 'AI Engineer' occupational category could catalyze population cohesion around the already-formed vocabulary, completing the co-attractor. Governance And Regulation	positive	high	potential for creating population cohesion (policy intervention effect)	0.03

A shared AI professional vocabulary crystallized quickly in early 2024, but practitioners never coalesced into a distinct occupation; instead, AI skills diffused into existing jobs rather than creating a new 'AI Engineer' class.