A decentralised, participatory approach to AI—combining many small contributor models—outperforms larger monolithic LLMs by up to 15.4% on reasoning and factuality benchmarks. Gains scale with contributor diversity and produce emergent capabilities that solve problems individual models cannot.

Scaling Participation in Modular AI Systems

Shangbin Feng, Yike Wang, Weijia Shi, Luke Zettlemoyer, Yejin Choi, Yulia Tsvetkov · June 05, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Composing many small, contributor-trained models into participatory modular systems yields up to 15.4% better performance than monolithic LLMs across 15 benchmarks, with diversity among contributors driving improvements and enabling emergent problem-solving abilities.

Humanity is a mosaic of multifaceted talents and needs, and any truly intelligent AI must reflect that richness. Yet the LLMs used by all are built by the few -- a centralized market of monolithic AI models structurally ill-suited to capture the diversity of human knowledge, reasoning, and values. Here we introduce scaling participation, a new paradigm in which modular AI systems are built from the bottom up through the contributions of diverse stakeholders. Participants contribute small models trained on their own interests and priorities; these models then collaborate in modular frameworks as compositional AI systems. Participatory AI systems outperform monolithic LLMs by up to 15.4% across 15 tasks, such as reasoning and factuality, surpassing models larger than all contributed components combined. Further experiments show that participatory AI systems benefit from contributor diversity, substantially improve on each contributor's original priorities, and exhibit emergent capabilities that allow them to solve over 15% of problems where all individual models fail. Scaling participation provides a technical foundation for transitioning from the monolithic status quo toward an open, bottom-up, and collaborative AI future.

Summary

Main Finding

Composing many specialized, independently contributed language models into modular, participatory AI systems (“scaling participation”) yields substantial performance gains over monolithic LLMs. With 61 academically contributed models (default pools of up to 32), diverse model-collaboration algorithms produce systems that outperform large monolithic baselines by up to ~15% (and by ~17% vs. a 405B dense model), with additional benefits in representing diverse human values and emergent problem solving.

Key Points

New paradigm: “scaling participation” — build modular AI systems bottom-up from many small models contributed by diverse stakeholders, rather than a single centralized monolith.
Pool and scale: solicited 61 academic language models (top 32 used for most experiments). Systems were built with 2, 4, 8, 16, and 32 models to test scaling effects.
Collaboration methods: 14 algorithms across three levels:
- API-level (routing, trained/router, graph routing, switch generation)
- Text-level (multi-agent refine, multi-agent finetune, LLM blender, AggLM, Sparta alignment, heterogeneous swarms)
- Weight-level (model merging/soups, extrapolation, model swarms, Dare-Ties), with distillation to unify architecture when required.
Evaluation: 5 domains / 15 tasks (general QA, reasoning, knowledge, safety, instruction following) with held-out/dev/test splits and macro-average reporting.
Main quantitative outcomes:
- Performance generally increases with the number of contributing models (2 → 32 gave ~28.8% average improvement across collaboration systems).
- Best participatory systems outperformed monolithic baselines (including models much larger than the sum of contributed models) by up to 15.4% overall and ~17.15% vs Llama 3.1 405B on average.
- Domain-specific lifts vary: instruction following (+37.8%), safety (+15.9%), knowledge (+6.3%), reasoning (+5.2%), general QA (+3.5%).
- Diversity matters: holding pool size fixed (8 models) but increasing institutional, international, model-distinctness diversity improved performance by ~13.8% on average (up to ~18–19% in some settings).
- Deeper, training-based integration methods (model swarms, Sparta, multi-agent finetuning) tended to yield the best results; shallow API-level routing degrades as pool size grows.
Additional benefits: better representation of cultural/values diversity (larger gains vs monolith on those tasks), emergent capability where the composite system solves >15% of problems on which all individual models fail, improved updateability, transparency and provenance.

Data & Methods

Model sourcing: outreach to 236 researchers, curated 61 academic LMs (diverse architectures, domains, priorities); default experiments used the top 32 performing models.
Distillation: where weight-level (parameter) merging required homogeneous architecture, each contributed model was distilled into a common student (Qwen-2.5-7B) to permit weight-level operations.
Collaboration protocols: implemented 14 methods spanning selection (routers), text-exchange (debate/refinement/aggregation), and weight-space blending/search (souping, extrapolation, swarm optimization). Used MoCo library to orchestrate experiments.
Tasks & datasets: AGIEval, ARC-challenge, MMLU-redux (QA); BigBench-Hard, GSM8k, MATH, Sciriff (reasoning); WikiDYK, PopQA, BLEND (knowledge); TruthfulQA, CocoNot (safety); AlpacaEval, Wildchat, Human Interest (instruction following). Macro-averaged domain scores reported.
Baselines: dense LLMs (Llama 3.1, Qwen 2.5 at multiple sizes), MoE models (Mixtral, Deepseek-MoE), participatory-but-merged system Marin; best single participating model per task was also compared.
Experimental design: model selection on held-out splits; dev/test splits for evaluation; scaling experiments across model counts and diversity dimensions; analysis of method rankings and domain-specific strengths.

Implications for AI Economics

Market structure and incumbency
- Reduced scale economies of model size: performance gains via modular composition of many smaller, specialized models suggest that capability need not solely accrue to very large, centralized models. This weakens a pure “bigger-is-better” concentration dynamic and can lower barriers to competitive entry.
- New multi-sided platforms: orchestration layers (routers, marketplaces, aggregation services) become valuable intermediaries that can capture rent even if model development is decentralized. Platform economics (network effects, two-sided markets) will likely shape future concentration patterns differently — value may accrue to orchestration and trust-provisioning entities rather than model builders alone.
Specialization, comparative advantage, and complementarities
- Markets for niche models: contributors can specialize on narrow domains, values, or languages; modular composition exploits complementarities, enabling efficient division of labor and richer product-market fit.
- Non-linear returns via composition: emergent capabilities appear when specialized components interact; this creates complementarities that can generate outsized returns to coordination and interoperability standards.
Cost structure and production
- Cost-efficient capability growth: composed collections of smaller models can outperform much larger monoliths, implying potentially lower aggregate compute/monetary costs to reach competitive performance for many tasks.
- Updateability and incremental investment: modular systems allow targeted upgrades to specific components (lower switching/update costs), improving responsiveness to new information and reducing costs of continual retraining.
Incentives, compensation, and governance
- Need for incentive mechanisms: contributors require reputational or monetary incentives (micro-payments, revenue shares, bounties) to participate honestly. Market design must address free-riding, contribution quality, and provisioning of public-good components.
- Reputation, provenance, and accountability: modular systems facilitate traceability (which component produced what), enabling better auditing and liability assignment — but also requiring standardized provenance metadata and certification markets.
- Risk of platform rent extraction: orchestration and aggregation layers could extract disproportionate rent unless governance, open standards, or competition constrain them.
Externalities, robustness, and security
- Heterogeneous contributions increase robustness to biased or narrow training data, improving representation and pluralism — a social welfare gain.
- However, the open participatory model introduces risks: malicious or low-quality contributors, coordination failures, collusion, and supply-side fragmentation. Economic design (verification, vetting, reputational systems, insurance) is required to mitigate these negative externalities.
Intellectual property and licensing
- Interoperability requires clear licensing regimes and model metadata. Intellectual property disputes and incompatible licenses could stymie modular markets unless standardized licensing frameworks (or clearinghouses) emerge.
- Distillation and weight-merging raise legal/contractual questions over downstream use, derivative works, and revenue sharing.
Policy and regulatory implications
- Antitrust and competition: regulators should note that decentralization can reduce concentration but new gatekeepers (platforms, orchestrators) may arise; policy should target competition along orchestration layers and open-interoperability requirements.
- Public funding and commons infrastructures: public investment in shared orchestration infrastructure (benchmarks, routers, verification tools, provenance registries) could increase social returns and reduce capture.
- Safety and audits: modularity can improve auditability, but regulation should require provenance, standardized documentation (model cards), and checks against harmful coordinated behavior.
Labor and innovation
- Demand for specialized model-building talent increases (niche domain expertise, multilingual/data-specific modeling), potentially broadening participation geographically.
- Innovation may accelerate through modular reuse and composition, lowering marginal cost of experimentation and enabling many small teams to contribute incremental improvements.

Practical market-design recommendations (concise) - Standardize interfaces, metadata (model cards, performance on uniform benchmarks), and provenance to enable safe, verifiable composition. - Build marketplaces with reputation, micro-payments, and dispute-resolution to align incentives. - Encourage open orchestration tooling and public goods (benchmark suites, verification frameworks) to reduce centralized gatekeeping. - Implement certification/auditing regimes and liability rules so responsibility for outputs is clear across composed systems. - Monitor platform concentration at orchestration layers and consider policies (interoperability mandates, open APIs) to promote competition.

Limitations & cautions (economic perspective) - Experimental pool is academically sourced; results may differ with commercial/proprietary models or adversarial participants. - Distillation and architecture-unification steps may lose information; economic value of original model IP may be affected. - Transitioning to participatory markets requires solving non-trivial incentive, governance, and standards problems before broad adoption.

Overall, this work suggests a plausible economic pathway away from monolithic-model dominance toward a modular ecosystem in which specialization, composition, and orchestration create new markets, change rent allocation, and reshape incentives for innovation and governance in AI.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper reports systematic experimental gains across multiple tasks and presents ablation/diversity analyses, which supports the claim that the modular approach can outperform monolithic models; however, key details are missing (number and selection of contributors, training/data heterogeneity, statistical significance reporting, and choice of baselines), and results are confined to benchmarks rather than real-world economic outcomes, leaving room for selection and evaluation biases. Methods Rigormedium — Methodologically sound in presenting controlled model-comparison experiments and analyses of diversity effects, but the rigor is limited by lack of transparent reporting on contributor/sample size and representativeness, limited description of evaluation metrics and statistical tests, potential benchmark selection bias, and unclear reproducibility of the compositional framework. SampleA set of small models trained by diverse individual contributors on their own interests/priorities (exact number and training data not reported in the summary), combined into modular compositional systems and evaluated across 15 tasks (reasoning, factuality, etc.); compared to larger monolithic LLM baselines (specific models and sizes not specified). Themesinnovation governance IdentificationControlled experimental comparisons: assemble modular systems from many contributor-trained small models and evaluate them against monolithic LLM baselines on a suite of 15 benchmark tasks (reasoning, factuality, etc.), with ablation and diversity analyses to link contributor heterogeneity to performance gains. GeneralizabilityUnknown representativeness and number of contributors — possible selection bias toward skilled or motivated participants, Benchmarks may not reflect real-world downstream economic tasks or production workloads, Comparisons depend on choice of monolithic baselines and evaluation metrics — may not hold against state-of-the-art tuned systems, Scalability and engineering costs of coordinating many contributors not evaluated, Emergent capabilities demonstrated on benchmarks may not transfer to safety-critical or production settings

Claims (6)

Claim	Direction	Confidence	Outcome	Details
Participatory AI systems outperform monolithic LLMs by up to 15.4% across 15 tasks, such as reasoning and factuality, surpassing models larger than all contributed components combined. Output Quality	positive	high	performance on tasks (reasoning and factuality)	n=15 up to 15.4% across 15 tasks 0.18
Participatory AI systems benefit from contributor diversity. Output Quality	positive	high	system performance (task performance) as influenced by contributor diversity	0.18
Participatory AI systems substantially improve on each contributor's original priorities. Output Quality	positive	medium	alignment / performance on contributors' priority objectives	0.11
Participatory AI systems exhibit emergent capabilities that allow them to solve over 15% of problems where all individual models fail. Output Quality	positive	high	problem solving / success rate on instances unsolved by individual models	over 15% of problems where all individual models fail 0.18
Participants contribute small models trained on their own interests and priorities; these models then collaborate in modular frameworks as compositional AI systems. Other	positive	high	architecture / system design (modular composition of contributor models)	0.09
Scaling participation provides a technical foundation for transitioning from the monolithic status quo toward an open, bottom-up, and collaborative AI future. Governance And Regulation	positive	high	feasibility of decentralized/participatory transition (technical foundation / governance implication)	0.03