Participatory provenance as representational auditing for AI-mediated public consultation

Artificial intelligence is increasingly deployed to synthesize large-scale public input in policy consultations and participatory processes. Yet no formal framework exists for auditing whether these summaries faithfully represent the source population, an accountability gap that existing approaches to AI explainability, grounding and hallucination detection do not address because they focus on output quality rather than input fidelity. Here, participatory provenance is introduced: a measurement framework grounded in optimal transport theory, causal inference and semantic analysis that tracks how individual public submissions are transformed, filtered or lost through AI-mediated summarization. Applied to Canada's 2025-2026 national AI Strategy consultation ($n = 5{,}253$ respondents across two independent policy topics), the framework reveals that both official government summaries underperform a random-participant baseline ($-9.1\%$ and $-8.0\%$ coverage degradation), with $16.9\%$ and $15.3\%$ of participants effectively excluded. Exclusion concentrates in clusters expressing dissent, scepticism and critique of AI ($33$-$88\%$ exclusion rates). Brevity, semantic isolation and rhetorical register independently predict representational outcome. An accompanying open-source interactive tool, the Co-creation Provenance Lab, enables policymakers to audit and iteratively improve summaries, establishing genuine human-in-the-loop oversight at scale.

Summary

Main Finding

AI-mediated summarization of large-scale public consultation responses can systematically exclude dissenting and critical voices. Applying a new "participatory provenance" framework to Canada’s 2025–2026 national AI Strategy consultation (two topics; n = 5,253 respondents) shows official government summaries represent the input population worse than a random-participant baseline (coverage degradation: −9.1% for Education, −8.0% for Trust). 15–17% of participants are effectively excluded overall, with exclusion concentrating in clusters expressing scepticism, critique or rejection of AI (cluster exclusion up to 78.6% and 88.1%). Brevity, semantic isolation and rhetorical register independently predict exclusion. An open-source tool (Co-creation Provenance Lab) operationalizes these diagnostics for policymakers.

Key Points

Participatory provenance: a formal measurement framework that audits the input→summary transformation rather than only output quality.
Four measurement components:
- Individual coverage score c(i) = max_j cos(e_i, s_j): semantic proximity of each participant to nearest summary sentence.
- Distributional divergence: Wasserstein-2 (W2) between participant embedding distribution and summary sentence distribution.
- Causal analysis: doubly-robust estimates to identify participant-level predictors of coverage (e.g., length, semantic distinctiveness, rhetorical register).
- Bidirectional concept fidelity: forward recall (participant concepts surviving in summary) and backward precision (summary concepts traceable to participants).
Empirical results (two topics):
- Samples: Education & Skills (n = 2,496), Safe AI & Public Trust (n = 2,757).
- Official summaries underperform a 6-sentence random-participant baseline (mean coverage drops of 9.1% and 8.0%).
- Exclusion rates: 16.9% (Education) and 15.3% (Trust) using threshold c < mean − std.
- Exclusion is concentrated in dissent/critique clusters (e.g., Critique of EdTech: 78.6% excluded; Distrust in oversight: 88.1% excluded), while pro-policy/workforce clusters have near-zero exclusion.
- Gini of coverage indicates representational inequality (G ≈ 0.18 Education; 0.15 Trust).
- Distortion persists relative to extractive baselines (centroid and greedy extractive selection) — abstractive summarization increased distributional divergence.
Robustness: patterns replicate across topics, embedding models, and parameter settings; cluster ordering invariant to reasonable threshold choices.
Tooling: Co-creation Provenance Lab (open-source) provides interactive diagnostics and supports iterative, human-guided summary improvements.

Data & Methods

Data: Canada’s 2025–2026 national AI Strategy consultation. Full consultation had ~64,600 responses; this study analyzes two pillars after preprocessing:
- Education & Skills: 2,496 usable English responses.
- Safe AI & Public Trust: 2,757 usable English responses.
- 2,392 respondents answered both topics (within-study replication).
Preprocessing: minimum-length filtering, French-language removal, near-duplicate detection, two-stage relevance filtering (embedding similarity + LLM adjudication).
Representations: pre-trained sentence transformers; PCA to 50 components (explained variance ~56%).
Clustering: k-means with k selected by multi-metric consensus (Silhouette, Calinski-Harabasz, Davies-Bouldin, Gap) and bootstrap stability (100 iterations).
Coverage metric: cosine similarity from participant embedding to each summary-sentence embedding; exclusion threshold = mean − 1×std.
Baselines:
- Random-participant baseline: 2,000 permutations drawing J=6 participants as pseudo-summary sentences.
- Centroid baseline: cluster centroids as pseudo-summary sentences.
- Greedy extractive-optimal baseline: best achievable selection of actual participant quotes.
Distributional distance: Wasserstein-2 (W2) in PCA-50 space.
Inequality: Gini coefficient of coverage distribution; cluster F-tests for coverage differences.
Causal estimation: doubly-robust causal inference to estimate effects of treatments (response length, distinctiveness, rhetorical register) on coverage.
Concept fidelity: bidirectional measurement of concept survival and traceability between participants and summary.
Limitations acknowledged by authors: focus on English responses, use of specific embedding and clustering pipelines, fixed short official summary length (six sentences), analysis of only two pillars (though replicated across them), and reliance on semantic embeddings as proxy for representational fidelity.

Implications for AI Economics

Distortion of public input has direct economic-policy consequences. If AI summaries systematically exclude dissenting or risk-averse voices, resulting policies may under-weight potential harms, over-favour commercialization incentives, or misallocate public investment (e.g., workforce training vs. regulation).
Manufactured consensus creates biased information inputs into policymaking models and cost–benefit analyses. Economic projections, regulatory impact assessments and market-structure decisions based on such summaries risk being optimistically biased toward adoption and scaling of AI, understating externalities and distributional harms.
Distributional consequences and welfare: exclusion of sceptical/minority viewpoints can lead to policies that overlook vulnerable groups or negative externalities, producing efficiency losses and adverse equity outcomes that standard economic evaluations would miss.
Market design and public procurement: governments buying or deploying summarization systems should internalize representational audit costs. Procurement criteria should include metrics from participatory provenance (coverage, W2, coverage-Gini, concept fidelity) to avoid systematic selection biases that favor industry-friendly narratives.
Regulatory economics: regulators assessing systemic risk, liability frameworks, or subsidy schemes need richer inputs that preserve the heterogeneity of public concerns. Participatory provenance metrics can be integrated into regulatory impact analysis to quantify potential bias in stakeholder synthesis.
Political economy of AI adoption: exclusion patterns concentrated in critique/opposition clusters may signal selection biases that align public messaging with incumbent industry interests or with normative framing that lowers perceived regulatory stringency—affecting lobbying dynamics and incentive structures.
Policy recommendations for economic actors:
- Require representational audits for automated consultation synthesis; report coverage, distributional divergence (W2), and coverage inequality metrics alongside any summary.
- Use extractive or hybrid summarization baselines when representational breadth is a priority; consider targeted weighting or enforced inclusion of low-frequency but high-relevance clusters.
- Incorporate participatory-provenance diagnostics into ex ante assessments (impact evaluations, welfare analysis) and ex post monitoring of implemented policies.
- Fund and adopt human-in-the-loop workflows and tooling (e.g., Co-creation Provenance Lab) to iteratively improve summaries while retaining efficiency gains.
Research agenda for AI economics:
- Quantify downstream policy outcome differences when using audited vs. unaudited summaries—estimate welfare and distributional impacts.
- Model how summarization-induced information distortion propagates through policymaker decision rules and affects economic equilibria (e.g., adoption rates, regulatory tightness, public-good provision).
- Explore correction mechanisms (reweighting, enforced inclusion quotas, longer/structured summaries) and their cost-effectiveness.

Short caveat: the framework uses semantic embeddings and clustering as proxies for "voice" and "topic" fidelity; while robust in this study, embedding/model choice and linguistic diversity (e.g., languages, registers) matter for deployment decisions.

Assessment

Paper Typecorrelational Evidence Strengthmedium — Uses a large, real-world dataset (n=5,253) and a principled quantitative framework (optimal transport + semantic analysis) to produce clear, replicable metrics of representational loss, but does not leverage randomized variation or natural experiments; results depend on modeling choices (embeddings, clustering, alignment parameters) and on assumptions required for observational causal claims. Methods Rigormedium — Applies advanced, state-of-the-art methods (optimal transport, semantic embeddings, formalized baseline comparisons) and reports concrete exclusion metrics and predictors, but methodological sensitivity to text-encoding choices, clustering thresholds, and alignment hyperparameters is not eliminable; also potential measurement error from summarizer style and multilingual/format variation could affect robustness. SampleSubmissions to Canada's 2025–2026 national AI Strategy public consultation, comprising 5,253 respondents across two independent policy topics; dataset includes raw participant texts and the official government summaries for those two topics; analyses include semantic clustering of submissions and per-participant alignment to summary content. Themesgovernance human_ai_collab IdentificationDevelops a measurement framework ('participatory provenance') that maps participant submissions to summary text using semantic embeddings and optimal-transport alignment to quantify coverage and exclusion; evaluates official summaries against a random-participant baseline (simulation) to estimate coverage degradation; assesses predictors of representational outcomes using observational regression/causal-inference routines (conditioned on observed semantic and metadata covariates). Identification therefore rests on the alignment metric and counterfactual baseline plus conditional-independence assumptions for predictor analyses. GeneralizabilitySingle-country (Canada) and single policy process/time period — may not generalize to other countries or consultation designs, Two policy topics only — topical idiosyncrasies may drive exclusion patterns, Relies on English (and possibly bilingual) textual embeddings and NLP tools — performance may differ across languages and dialects, Findings depend on specific government summarization style and conventions; other summarizers (commercial or internal) may behave differently, Observational design — predictors of exclusion may not be causal and may not generalize beyond this sample

Claims (11)

Claim	Direction	Confidence	Outcome	Details
No formal framework exists for auditing whether AI-generated summaries faithfully represent the source population. Other	null_result	medium	existence of an auditing framework for input fidelity	0.09
This paper introduces 'participatory provenance': a measurement framework grounded in optimal transport theory, causal inference and semantic analysis that tracks how individual public submissions are transformed, filtered or lost through AI-mediated summarization. Other	positive	high	ability to track transformations/filtration/loss of individual submissions	0.3
The framework is applied to Canada's 2025-2026 national AI Strategy consultation with n = 5,253 respondents across two independent policy topics. Other	positive	high	sample and context for empirical evaluation	n=5253 0.5
Both official government summaries underperform a random-participant baseline for topic A (coverage degradation of -9.1%). Output Quality	negative	high	coverage (coverage degradation relative to random baseline)	n=5253 -9.1% coverage degradation 0.3
Both official government summaries underperform a random-participant baseline for topic B (coverage degradation of -8.0%). Output Quality	negative	high	coverage (coverage degradation relative to random baseline)	n=5253 -8.0% coverage degradation 0.3
In topic A, 16.9% of participants are effectively excluded by the official summary. Output Quality	negative	high	participant exclusion rate	n=5253 16.9% of participants excluded 0.3
In topic B, 15.3% of participants are effectively excluded by the official summary. Output Quality	negative	high	participant exclusion rate	n=5253 15.3% of participants excluded 0.3
Exclusion concentrates in clusters expressing dissent, scepticism and critique of AI, with exclusion rates of 33%–88% in such clusters. Output Quality	negative	high	cluster-level exclusion rate for dissenting/sceptical/critical clusters	33%-88% exclusion rates 0.3
Brevity, semantic isolation and rhetorical register independently predict representational outcome (i.e., which submissions are included/excluded in summaries). Output Quality	negative	high	predictive relationship between textual features and representational outcome (coverage/exclusion)	n=5253 0.3
An accompanying open-source interactive tool, the Co-creation Provenance Lab, enables policymakers to audit and iteratively improve summaries, establishing genuine human-in-the-loop oversight at scale. Governance And Regulation	positive	medium	availability and claimed capability of the Co-creation Provenance Lab to support auditing and iterative improvement of summaries	0.03
Existing approaches to AI explainability, grounding and hallucination detection do not address input fidelity because they focus on output quality rather than input fidelity. Ai Safety And Ethics	negative	medium	scope of existing explainability/grounding/hallucination detection methods with respect to input fidelity	0.09

AI-assisted government summaries underrepresent dissent: Canada's 2025–26 AI Strategy consultation summaries excluded 15–17% of participants and performed 8–9 percentage points worse than a random-participant baseline, with critics of AI facing the highest exclusion rates.