A transformer trained on patent classification codes can forecast novel technological pairings years — sometimes decades — ahead by detecting collective shifts in how technologies are described, and it also improves representations for common patent analysis tasks.

Anticipating Innovation Using Large Language Models

Enrico Maria Fenoaltea, Filippo Santoro, Giordano De Marzo, Segun Taofeek Aroyehun, Andrea Tacchella · May 06, 2026

arxiv correlational medium evidence 7/10 relevance Source PDF

TechToken, a transformer that treats IPC codes as tokens, detects collective shifts in patent language that predict first-time technological combinations years to decades in advance and yields better patent representations than existing models.

Forecasting innovation, intended as the emergence of new technological combinations, is a fundamental challenge for science and policy. We show that forthcoming combinations leave an early trace in the collective language of patents, with predictive signals detectable even decades in advance. We show that signal is not attributable to any single inventor, but emerges as a collective shift in how technologies are described across thousands of patents. To this end, we introduce TechToken, a transformer-based model that treats technologies, classified by International Patent Classification codes, as words in its vocabulary, learning the language of technologies by embedding these codes during fine-tuning. We define context similarity between code embeddings as a measure of linguistic convergence and show that it accurately predicts first technological combinations. TechToken also improves general representation quality, outperforming state-of-the-art models across different patent-related tasks.

Summary

Main Finding

The authors show that impending technological combinations leave an early, detectable trace in the collective language of patents. By embedding IPC technology codes directly into a transformer model’s vocabulary (TechToken) and measuring “context similarity” (CS) between code embeddings, they can predict first-time co-occurrences of technologies (i.e., novel recombinations) years—often decades—before those combinations appear in patents. The signal is corpus-level and emergent (not attributable to individual documents or inventors). TechToken also yields state-of-the-art patent embeddings that improve performance on standard patent tasks.

Key Points

Innovation as recombination: forecasting novel combinations of existing technologies (IPC codes) reduces the unprecedented to a structured link-prediction problem (adjacent possible).
Linguistic convergence precedes technological convergence: the textual contexts of two technologies begin to resemble each other well before they co-occur in a patent.
TechToken method:
- Expands a language model’s vocabulary with tokenized IPC codes (technological tokens).
- Fine-tunes the LM so attention operates between patent words and IPC tokens, producing per-patent IPC-code embeddings that capture polysemy/context-specific meanings.
Context Similarity (CS):
- CS between two IPC codes is measured as cosine similarity among their embeddings.
- For TechToken, CS is computed as the average similarity of the top 1% most similar embedding pairs (captures closest senses while reducing noise).
- CS is computed over 1-year windows (plots often use a 3-year sliding average).
Empirical performance:
- TechToken substantially outperforms baseline embedding strategies and other LMs (including much larger LLaMA 3.1 8B) at predicting first co-occurrences.
- Example AUC-ROC (class-imbalance 0.005%): TechToken BERT (IPC) ≈ 0.936; LLaMA Patents ≈ 0.856; BERT4Patents ≈ 0.725; fine-tuned BERT4Patents ≈ 0.765.
- Predictive signals can be observed on average up to ~20 years before first co-occurrence.
Robustness checks:
- Use of a Chung–Lu random graph null model to compute expected co-occurrence counts and z-scores; high-z pairs are treated as realized innovations to focus on statistically significant novel combinations.
- Multiple training/test splits and a stricter out-of-sample test (using 2019–2023 to predict first co-occurrences in 2024) to avoid leakage.
Broader utility: TechToken embeddings improve downstream patent tasks (IPC classification, citation prediction, title–abstract matching), suggesting better semantic grounding from including structured labels in attention.

Data & Methods

Data:
- Patent corpus with IPC (International Patent Classification) labels and textual fields (abstracts, titles); temporal windows spanning at least 2006–2024 for various experiments.
- Focus on code pairs that had never co-occurred up to the training window, then checking first co-occurrence in future windows.
Embedding strategies:
- Average Embedding: compute patent text embeddings with a standard LM, then average embeddings of patents associated with an IPC code → single static embedding per IPC.
- TechToken: add IPC codes as tokens in LM vocabulary; during fine-tuning, IPC tokens appear alongside patent text, producing per-patent IPC token embeddings (captures context-dependent meaning).
Measuring CS:
- For two IPC codes, collect all per-patent token embeddings from the relevant time window, compute pairwise cosine similarities, then average the top 1% highest similarities to define CS for that year.
Forecasting evaluation:
- Construct candidate IPC pairs that never co-occurred before the training window.
- Use CS from training window as classifier score to predict which pairs will become “realized innovations” in a later test window.
- Define “realized innovation” by ranking code-pair z-scores (observed minus expected co-occurrences under Chung–Lu null, scaled by sqrt variance) and labeling top-K pairs as positive (controls class imbalance).
- Report AUC-ROC across time-shifted 5-year test windows and at different class-imbalance thresholds (e.g., 0.005%).
Baselines:
- BERT4Patents (base), BERT4Patents further fine-tuned, LLaMA 3.1 8B fine-tuned on patents (LLaMA Patents) using alternative embedding methods.
Additional tasks:
- Evaluate TechToken embeddings on IPC classification, citation prediction, and title–abstract matching vs. prior state-of-the-art.

Implications for AI Economics

Forecasting the adjacent possible: provides a practical, data-driven method to detect which technological recombinations are becoming feasible before they materialize—valuable for policymakers, industrial strategy, and R&D planning.
Early signals for policy and investment:
- Governments and firms could use CS-based monitoring to identify emerging cross-domain synergies and target early-stage support or investment.
- Helps prioritize infrastructure, standards, or regulation in domains where linguistic convergence signals rapid convergence risk/opportunity.
R&D portfolio and firm strategy:
- Firms can detect nascent complementarities and guide acquisitions, alliances, hiring, or internal R&D allocation toward technologies whose language contexts are converging.
Mapping technology trajectories and diffusion:
- CS trajectories provide quantitative measures of conceptual convergence and can be incorporated into models of technology diffusion, capability accumulation, and sectoral evolution.
Complement to network-based and bibliometric methods:
- TechToken’s language-grounded signal captures emergent semantic change not visible from co-occurrence networks alone; combining signals (text + network metrics + citations) should improve forecast robustness.
Methodological transfer:
- The approach (embedding symbolic classification codes as tokens) generalizes beyond patents—e.g., scientific taxonomies, medical codes, industry/product classifications—allowing forecasting in other knowledge domains.
Cautions and limitations:
- Patent coverage bias: not all innovations are patented; sectoral differences in patenting behavior can bias signals.
- IPC granularity: IPC codes are coarse; important micro-innovations inside codes may be missed; conversely, IPC assignment practices may introduce artifacts.
- Language-driven confounders: increases in CS may reflect discourse trends, regulatory events, or increased attention rather than underlying technological feasibility—causality is not guaranteed.
- Model and data vintage: embeddings depend on the training/fine-tuning data and cutoff dates; real-time deployment requires continual updates.
- Statistical thresholds and choice of top-k/top-% affect sensitivity/specificity; operational use requires calibration to policy or investment risk appetite.
Research opportunities:
- Integrate CS signals into economic models of innovation, endogenous growth, and industrial policy evaluation.
- Combine with firm-level, inventor-network, and market data to link early linguistic convergence to commercialization outcomes.
- Explore multi-source fusion (patents + publications + grants + news) to triangulate signals and reduce false positives.
- Use TechToken-style tokenization for domain-specific ontologies to forecast recombinations in science, medicine, and digital platforms.

Summary takeaway: embedding structured technology labels into transformer attention (TechToken) reveals a robust, corpus-level linguistic precursor to technological recombination that can meaningfully improve forecasting of novel innovations and has actionable implications for economic policy, corporate strategy, and further research in the economics of innovation.

Assessment

Paper Typecorrelational Evidence Strengthmedium — The paper provides strong predictive/associational evidence that embeddings of IPC codes contain early signals of forthcoming technology combinations, validated with out-of-sample forecasting and downstream task improvements; however, it does not establish causal mechanisms (e.g., why language shifts precede combinations) and results may be sensitive to patenting practices, classification changes, and other confounders. Methods Rigormedium — The authors introduce a plausible, state-of-the-art modeling approach (transformer-based embeddings of IPC codes), benchmark against SOTA on multiple patent tasks, and perform checks to rule out single-inventor effects, which indicates solid methodological work; but the rigor will depend on details not provided here (extent of out-of-time validation, robustness to IPC taxonomy drift, baseline selection, hyperparameter tuning, and controls for institutional/patenting confounds). SampleLarge corpus of patents spanning multiple decades, where each patent is represented by its International Patent Classification (IPC) codes; TechToken is trained/fine-tuned on sequences of IPC codes (treating codes as vocabulary tokens); predictive target is the first co-occurrence (novel combination) of IPC codes, with forecasting horizons up to decades and evaluation also performed on several patent-related downstream tasks. Themesinnovation adoption GeneralizabilityRestricted to innovations disclosed via patents — non-patented innovation (open-source, tacit knowledge, academic-only advances) is excluded, Potential jurisdictional bias if the corpus is limited to USPTO/EPO/JPO patents (practices and topics vary by office and country), International Patent Classification (IPC) taxonomy evolves over time; model may pick up classification changes rather than genuine technological convergence, Fields with low patenting intensity (e.g., some services, social innovations) are underrepresented, biasing results toward patent-intensive sectors, Signal may reflect shifts in descriptive language or patenting strategy (e.g., strategic claiming) rather than true underlying technological recombination, Aggregate, corpus-level signal may not generalize to firm- or inventor-level forecasting without additional validation

Claims (5)

Claim	Direction	Confidence	Outcome	Details
Forthcoming combinations leave an early trace in the collective language of patents, with predictive signals detectable even decades in advance. Innovation Output	positive	high	prediction of first occurrence of new technological combinations	0.3
The predictive signal is not attributable to any single inventor, but emerges as a collective shift in how technologies are described across thousands of patents. Innovation Output	negative	medium	attribution of predictive linguistic signal to individual inventors versus collective patent-language change	0.18
We introduce TechToken, a transformer-based model that treats technologies, classified by International Patent Classification (IPC) codes, as words in its vocabulary, learning the language of technologies by embedding these codes during fine-tuning. Other	positive	high	ability to represent IPC-coded technologies as embeddings (model representation quality)	0.3
Context similarity between code embeddings, defined as a measure of linguistic convergence, accurately predicts first technological combinations. Innovation Output	positive	high	accuracy of predicting first joint occurrence (combination) of IPC codes / technologies	0.3
TechToken improves general representation quality, outperforming state-of-the-art models across different patent-related tasks. Other	positive	high	downstream task performance / representation quality on patent-related tasks	0.3