A clear, technical boundary—open pre-training artifacts versus proprietary post-training weights—could break the logjam in industry–academia ML deals; adopting the PBOS template would keep scientists at the negotiating table and streamline collaborations.

Position: The Pre/Post-Training Boundary Should Govern IP in Industry-Academia ML Collaborations

Dirk Bergemann, Soheil Ghili, Nitzan Mekel-Bobrov · May 21, 2026

arxiv commentary n/a evidence 7/10 relevance Source PDF

Proposes PBOS, a standard contract template that treats pre-training artifacts as open science and post-training weights as proprietary business IP, to resolve incentive conflicts that stall industry–academia ML collaborations.

Industry-academia ML collaborations routinely fail to launch -- not for scientific reasons, but because academics must publish while companies must protect models trained on proprietary data, and no standard contract framework resolves this tension. Because contracts are negotiated by legal departments alone, many apparent legal disputes are incentive misalignment problems that only scientists at the table can correctly diagnose. We propose PBOS (Protect-the-Business / Open-Source-the-Science), a community-adoptable contract template anchored to a single technically-grounded boundary: pre-training artifacts (architectures, training code, benchmarks, untrained weights) are open science; post-training artifacts (weights trained on proprietary data) are business IP. This boundary is technically meaningful, legally clean, and auditable -- and could not have been drawn correctly without scientists at the negotiating table. We argue the ML community should adopt PBOS as its default contract for such collaborations.

Summary

Main Finding

The authors propose PBOS (Protect-the-Business / Open-Source-the-Science): a contract template for industry–academia ML collaborations that uses a single, technically grounded rule—the pre/post-training boundary—to allocate intellectual property. Pre-training artifacts (architectures, training code, benchmarks, untrained weights, papers) are explicitly open science; post-training artifacts (weights trained on proprietary company data) are company IP, with universities granted a narrow research license for evaluation and publication. PBOS is argued to be technically meaningful, legally clean, auditable, and to materially reduce failed collaborations caused by incentive misalignment.

Key Points

Problem diagnosed: systematic incentive misalignment. Academia must publish; companies must protect assets derived from proprietary data. Without a standard boundary, contracts stall or default to broad, ambiguous ownership claims.
The pre/post-training boundary:
- Pre-training artifacts do not contain signals from proprietary data and thus can be released.
- Post-training artifacts incorporate proprietary data patterns (membership inference, reconstruction risks) and should remain under company control.
- The criterion is exposure to proprietary data, not artifact form or author.
PBOS operationalized via three pillars:
Explicitly define “The Science” (enumerate artifact classes) with scientists involved in drafting.
Protect the business: trained models on company data are company property; university gets a narrow research license (evaluate, produce results, publish) but not distribution/commercialization rights.
Open-source the science: contractually commit to releasing pre-training artifacts under a permissive license.
Practical mechanics:
- Training on company infrastructure; trained weights do not leave except under a controlled research license.
- Short company review window (e.g., 30 days, with limited patent-delay option) to check for inadvertent proprietary leakage before publication.
- Short, simple contracts reduce ambiguity and negotiation friction.
Necessity of researcher participation: legal teams cannot reliably identify which artifacts encode proprietary information; scientists are needed to determine where the boundary falls (including fine-tuning, transfer learning, multi-stage pipelines).
PBOS codifies a norm already used intra-firm (companies publish architectures/methods but keep trained weights), and makes it portable across institutions.
Expected benefits: fewer stalled collaborations, broader access to problems requiring proprietary behavioral data, reduced transaction costs, and a partial correction of a scientific “market failure” where valuable domains go unexplored by academia.

Data & Methods

Paper type: position / policy proposal (working paper). No new empirical dataset or econometric analysis presented.
Evidence and argumentation:
- Motivating case: a 2024 collaboration delayed months in legal negotiation over publishability vs IP (used as illustrative example).
- Conceptual and technical reasoning about information content of artifacts (trained weights vs untrained artifacts), auditability (transfer logs, repository records), and risks (membership inference, reconstruction).
- Institutional examples: intra-firm practice cited (AlphaGo, PaLM, GPT-3) to show the boundary already works within firms.
- Implementation materials: PBOS contract template, clause pack, and guidance are made available on GitHub (link in paper).
Methodological limitations:
- No large-sample empirical evaluation of PBOS adoption or its effects on collaboration rates or welfare.
- Policy/legal analysis is descriptive and prescriptive rather than statutory—interaction with jurisdictional IP law, trade-secrets doctrine, and employment agreements needs operational legal testing.
- Some edge cases (e.g., how much fine-tuning leaks proprietary distributional information) require technical adjudication and may be contested.

Implications for AI Economics

Incentives and market failure:
- PBOS aims to lower bargaining/transaction costs that deter socially valuable collaborations—especially in domains requiring proprietary behavioral or production logs—potentially correcting an underinvestment in socially valuable research.
- By partitioning what must remain private (trained models) vs what must be public (methods, code, benchmarks), PBOS shifts the distribution of rents: firms keep commercial advantage from data-trained models, while the broader scientific community benefits from methodological diffusion.
Knowledge diffusion and competition:
- Greater release of architectures and training pipelines should accelerate methodological progress and replication across academia and industry.
- Retention of trained weights by firms preserves firm-specific competitive moats tied to proprietary data and compute; effects on market concentration depend on how much downstream competition requires access to trained weights versus architectures and training recipes.
Labor and talent flows:
- Easier collaboration and clearer IP rules may increase cross-sector mobility and co-authorship, affecting compensation and signals for academic and industry researchers.
Empirical research agenda for economists:
- Measure whether PBOS-style defaults reduce negotiation time, increase collaboration incidence, and expand academic work in data-dependent domains.
- Quantify welfare gains from additional academic access to problems involving proprietary behavioral data.
- Analyze dynamic effects: does clearer boundary preserve firm incentives to invest in data collection, or do increased methodological spillovers reduce investment? How does that depend on data exclusivity versus algorithmic innovations?
- Study heterogeneous firm responses: do smaller firms accept PBOS less readily than large incumbents? Are there industry sectors where PBOS is more valuable?
Potential risks and trade-offs:
- Narrow research licenses may still limit academic evaluation and external replication if access to trained models is too constrained—careful design of the license scope matters.
- Legal and regulatory frictions across jurisdictions (trade secrets, data-protection law) could complicate PBOS implementation.
- Adversarial or membership-inference attacks could still leak private information indirectly; technical safeguards and audit protocols will be necessary.
Policy recommendation for practitioners and economists:
- Institutions should pilot PBOS-style templates, collect data on negotiation times and collaboration outcomes, and enable researchers to participate in drafting ownership definitions.
- Economists should evaluate pilots to estimate net welfare effects, distributional impacts, and long-run innovation dynamics.

References and implementation materials are provided in the working paper (including a GitHub repo with clause packs and guidance).

Assessment

Paper Typecommentary Evidence Strengthn/a — The manuscript is a conceptual policy/proposal without empirical tests or causal identification; claims are plausibility arguments and anecdotes rather than measured evidence. Methods Rigorn/a — No empirical or formal methods are deployed—this is a normative contract proposal and argumentation rather than an analysis requiring methodological rigor. SampleNo empirical sample or dataset; the paper presents a conceptual contract template (PBOS) and argues from technical, legal, and incentive considerations, apparently drawing on authors' experience and illustrative examples rather than systematic data. Themesgovernance adoption innovation GeneralizabilityVaries by legal jurisdiction: IP law and contract enforcement differ across countries and may limit applicability., Different industry business models: firms with different incentives (e.g., pure-play research labs, startups, regulated industries) may not fit the PBOS split., Model and data diversity: distinctions between pre-/post-training artifacts may blur for certain architectures (fine-tuning, continual learning, transfer learning) or for models trained on mixed proprietary/public data., Licensing and open-source norms: existing license ecosystems and community norms may conflict with PBOS provisions., Security and safety considerations: safety-sensitive models (e.g., dual-use risks) may require stricter access controls than PBOS envisions., Organizational practices: legal teams, institutional review boards, and research offices may resist or reinterpret the template., Lack of empirical validation: effectiveness in reducing negotiation time and increasing collaborations is untested and may vary across contexts.

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Industry-academia ML collaborations routinely fail to launch. Adoption Rate	negative	high	success rate of launching industry-academia ML collaborations	0.03
These failures are not for scientific reasons, but because academics must publish while companies must protect models trained on proprietary data, and no standard contract framework resolves this tension. Organizational Efficiency	negative	high	incentive alignment between academic publication requirements and company IP protection	0.01
Because contracts are negotiated by legal departments alone, many apparent legal disputes are incentive misalignment problems that only scientists at the table can correctly diagnose. Organizational Efficiency	negative	high	quality of contract negotiations / correct diagnosis of incentives in disputes	0.01
PBOS: pre-training artifacts (architectures, training code, benchmarks, untrained weights) are open science; post-training artifacts (weights trained on proprietary data) are business IP. Governance And Regulation	positive	high	classification of artifact ownership under collaborative contracts	0.01
This boundary (pre-training open / post-training proprietary) is technically meaningful, legally clean, and auditable. Governance And Regulation	positive	high	technical and legal clarity/auditability of artifact boundary	0.01
The boundary could not have been drawn correctly without scientists at the negotiating table. Governance And Regulation	positive	high	appropriateness/correctness of contractual artifact boundaries	0.01
The ML community should adopt PBOS as its default contract for such collaborations. Governance And Regulation	positive	high	community adoption of PBOS as default contracting practice	0.01