Translators have been converted into AI's unpaid data capital: their aligned translations fuel modern machine translation and LLM training, yet contract terms and copyright interpretations treat these outputs as mined data rather than creative labour, leaving translators unrecognised and uncompensated. The paper maps the data-supply chain across translators, LSPs, platforms and model developers and proposes legal and design directions to redistribute value.

Translators as Invisible Teachers of AI: Copyright, Translation Memory, and the Political Economy of Linguistic Data

Masaru Yamada · May 24, 2026

arxiv descriptive n/a evidence 7/10 relevance Source PDF

The paper argues that translators' renditions—captured in translation memories and parallel corpora—have been transformed into high-value supervised training data for machine translation and LLMs while legal and contractual regimes strip translators of recognition and remuneration, a process the author terms 'appropriation without consumption' and 'invisible teacherisation.'

This paper examines how the labour of translators has been transformed into foundational data capital for the age of artificial intelligence (AI). Translation memories (TM) and parallel corpora preserve a one-to-one correspondence between source and target text and therefore constitute extraordinarily valuable supervised training data for machine translation. The development of statistical machine translation (SMT), neural machine translation (NMT), the Transformer architecture, and multilingual large language models (LLMs) cannot be disentangled from the accumulation of such translation data. And yet, translators' renditions have been bought as deliverables under contract, segmented as technical objects, and processed as "information analysis" data under copyright law -- losing their moral, creative, and economic attribution to the translators who produced them. The paper develops two concepts to capture this process. The first is appropriation without consumption: a mode of use in which works are not read, viewed, or listened to, but only mined for statistical features -- a use that is legitimated under Article 30-4 of the Japanese Copyright Act. The second is the invisible teacherisation of translators: the process by which translators, through the construction of translation memories, post-editing, and quality assessment, have functioned as teachers of AI without recognition as such. Drawing on the data supply chain that runs from translators through language service providers (LSPs) and platforms to model developers, on a comparative reading of Japanese, European, and United States legal frameworks, on the distinction between open and proprietary AI models, and on the premium status that human-generated data has acquired in the era of model collapse, the paper asks what translators are actually afraid of, and points toward concrete directions for redistributive design.

Summary

Main Finding

Professional translators’ work—especially translation memories and parallel corpora—has been systematically converted into high-value supervised training data for machine translation and multilingual LLMs. This conversion operates through layered commercial and legal processes that strip translators of moral, economic, and pedagogical recognition. The paper names and analyses two central dynamics: (1) appropriation without consumption (statistical mining of works legitimized by legal limits that treat training as “non-enjoyment”), and (2) the invisible teacherisation of translators (translators functioning as unrecognised teachers of AI through TM construction, post-editing, and quality assessments).

Key Points

Translation memory (TM) is dual-purpose: a productivity tool for translators and a machine-readable database of input–output pairs ideally suited to supervised learning. That institutional duality is the origin of much current translator anxiety.
Historical/technical role: parallel corpora and translation tasks were central to SMT and NMT, and the Transformer architecture was developed and benchmarked in machine-translation contexts. Translation data remain important for multilingual capability and instruction tuning in LLMs.
Data-supply chain (four tiers):
Translators → Language Service Providers (LSPs): contractual transfers of reuse rights; fuzzy-match discounts suppress pay.
LSPs → Platforms / in-house engines: LSPs may train proprietary MT on accumulated TM or sell data.
Open web → Model developers: crawled parallel corpora (OPUS, ParaCrawl, CCMatrix, etc.) used in pre-training without contractual linkage to individual translators.
Post-editing: ongoing human correction provides high-quality paired data (machine output → human correction), but this pedagogical labour is unremunerated and unrecognised.
Copyright mismatch: copyright protects concrete renderings but not stylistic patterns, decision tendencies, or accumulated judgement—the very features AI learns. This creates an extraction zone poorly addressed by existing IP law.
Comparative legal regimes:
- Japan (Article 30-4): permits “information analysis” uses (non-enjoyment use) for AI training without statutory opt-out or transparency obligations—labelled “especially extractive” in practice.
- European Union: default opt-out for TDM and emerging transparency/summary obligations under the AI Act—gives rights holders more institutional leverage.
- United States: relies on fair use and case law (e.g., Authors Guild v. Google; ongoing litigation like Andersen v. Stability AI) —outcomes uncertain and litigative.
Conceptual contributions:
- Appropriation without consumption: extraction of statistical features from works where the work is not “consumed” (read/listened), thereby bypassing classical infringement logics.
- Invisible teacherisation: translators serve as de facto teachers of AI via TM creation, post-editing, and quality assessment but receive little recognition or compensation for that role.
Market dynamics: human-generated, high-quality parallel data command a premium (especially in the “model collapse” era when models degrade or overfit), creating new rent-seeking/monetisation opportunities for data controllers but not typically for the original labour providers.

Data & Methods

Methods are qualitative, legal-analytical, and historical-technological:
- Technical history reconstruction: tracing SMT → NMT → Transformer → LLM developments with literature references (e.g., Bahdanau et al. 2015; Wu et al. 2016; Vaswani et al. 2017; Radford et al. 2018; Zhu et al. 2025).
- Institutional/data-supply-chain analysis: decomposition into four tiers from translators to model developers, highlighting contractual and market practices (LSP contracts, fuzzy-match economics, platform scraping).
- Comparative legal analysis: close reading of Japan’s Article 30-4, EU DSM Directive and AI Act obligations, and the U.S. fair-use framework; consideration of jurisprudence and policy statements (e.g., Agency for Cultural Affairs (Japan), Society of Authors, FIT).
- Use of secondary empirical signals: professional association surveys (e.g., Society of Authors 2024; FIT 2023) and ongoing litigation examples (Andersen v. Stability AI; NYT v. OpenAI; Getty Images v. Stability AI) to illustrate labour-market and legal stakes.
No new quantitative dataset is produced; the argument synthesises prior empirical reports, legal texts, and technical literature into a political-economy account.

Implications for AI Economics

Data-as-capital and labour extraction: Translation labour has been capitalised into durable data assets. That transfer transforms one-off labour payments into ongoing asset value for LSPs and model developers, exacerbating returns-to-capital and reducing labour share in an important segment of language services.
Market power & rent capture: Entities that aggregate TMs (LSPs, platforms) capture premium rents from proprietary engines or data sales. Because translators often contract away reuse rights, bargaining power is weak, allowing downstream actors to internalise gains from the same labour input.
Contract design & pricing: Existing industry pricing (fuzzy-match discounts, one-time delivery fees) undervalues the long-lived training value of translations. Pricing and contracting need redesign (e.g., licensing terms that account for training value, royalties, or data-sharing payments).
Regulatory design matters: Jurisdictional differences (Japan permissive; EU opt-out/transparency; U.S. litigation-driven) shape the economics of data acquisition. Stronger opt-out, transparency, and provenance rules increase transaction costs for developers and create leverage for data providers, potentially shifting rents back toward labour.
Labor policy and collective action: Trade-union organising or sectoral bargaining among translators can target the LSP tier to secure better terms (contractual clauses forbidding repurposing, post-editing premiums, data-rights retention). Collective bargaining is a practical lever because contractual transfers typically occur at this stage.
Market for high-quality human data: The premium on human-curated parallel data—especially post-edited pairs—creates opportunities for new market structures (paid licensing platforms, certification of human-generated training corpora). Without redistributive mechanisms, however, incumbents capturing aggregated corpora will reap most gains.
Redistribution & governance solutions suggested by the paper (economics-relevant):
- Contractual innovations: clauses that retain translators’ rights for ML use or specify remuneration when translations are used for training.
- Transparency/provenance systems: mandatory training-data manifests (as in parts of EU policy) to enable claims and bargaining.
- Licensing frameworks: standardized licensing for human-generated language data with revenue-sharing or micropayment architectures.
- Recognition and remuneration for post-editing: treating post-editing as paid pedagogical labour with explicit compensation and attribution.
- Broader IP and data-governance reform: rethinking how copyright and data-limits treat “non-enjoyment” analysis uses to rebalance incentives between labour and capital.
Broader economic effect: Without reforms, AI development will continue to internalise human linguistic labour as low-cost input, reinforcing concentration in downstream AI firms and LSP aggregators, while shrinking tradable incomes and career viability for translators—transforming a cultural/creative profession into extracted data inputs.

If you want, I can (a) produce a one-page policy brief for regulators translating these findings into concrete legislative options, or (b) outline specific contractual clauses translators or unions could adopt to protect data/training rights. Which would be most useful?

Assessment

Paper Typedescriptive Evidence Strengthn/a — The paper is a qualitative, conceptual and legal analysis rather than an empirical study testing causal hypotheses; it does not present causal identification or quantitative estimation. Methods Rigormedium — Employs careful comparative legal reading, historical tracing of machine translation technologies, and a detailed data-supply-chain description and conceptual innovation (two new concepts). However, it lacks systematic empirical data (e.g., representative interviews, transaction-level datasets, or quantitative measurement of value flows) that would strengthen claims about scale and economic impact. SampleQualitative materials: comparative statutory and case-law texts (Japanese, EU, US copyright frameworks including Article 30-4 of Japanese Copyright Act), literature on translation memories, SMT/NMT/Transformer and multilingual LLMs, industry practices across translators, language service providers (LSPs), platforms, and model developers; illustrative examples/case studies rather than a structured dataset or representative survey. Themeslabor_markets governance inequality innovation human_ai_collab GeneralizabilityLegal and institutional findings are jurisdiction-specific (focus on Japan, Europe, United States) and may not hold in other legal systems, Focuses narrowly on translators and translation memories; implications for other creative professions or data types are suggestive but not empirically established, Industry heterogeneity (size of LSPs, market segment, language pairs, public vs proprietary models) limits uniform application of conclusions, Rapid evolution of models, market practices, and licensing regimes may change relevance over time

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Translation memories (TM) and parallel corpora preserve a one-to-one correspondence between source and target text and therefore constitute extraordinarily valuable supervised training data for machine translation. Research Productivity	positive	high	value of translation data as supervised training inputs for MT	0.3
The development of statistical machine translation (SMT), neural machine translation (NMT), the Transformer architecture, and multilingual large language models (LLMs) cannot be disentangled from the accumulation of translation data (TM/parallel corpora). Innovation Output	positive	high	dependence of major MT/LLM advances on accumulated translation data	0.18
Translators' renditions have been bought as deliverables under contract, segmented as technical objects, and processed as 'information analysis' data under copyright law—resulting in the loss of moral, creative, and economic attribution to the translators who produced them. Labor Share	negative	high	loss of attribution and economic recognition for translators	0.18
Article 30-4 of the Japanese Copyright Act legitimates a mode of use the paper terms 'appropriation without consumption'—i.e., mining works for statistical features rather than reading or experiencing them. Governance And Regulation	positive	high	legal legitimation of non-experiential mining of copyrighted works	0.3
Translators have functioned as 'invisible teachers' of AI—through the construction of translation memories, post-editing, and quality assessment—without recognition as teachers of models. Labor Share	negative	high	lack of recognition/attribution for contributors who effectively trained AI	0.18
There exists a data supply chain that runs from individual translators through language service providers (LSPs) and platforms to model developers. Market Structure	positive	high	structure and flow of translation data across actors	0.3
Human-generated translation data has acquired a premium status in the era of model collapse, increasing its value to model developers. Firm Revenue	positive	medium	market valuation/premium of human-generated data for models	0.02
Comparative analysis of Japanese, European, and United States legal frameworks shows differing treatments of translation data and points toward the need for redistributive design to remedy unequal attribution and capture. Governance And Regulation	mixed	high	policy/regulatory implications and proposals for redistributive design	0.18