Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

Agent-to-Agent (A2A) networks enable autonomous AI agents to collaborate by sharing reusable problem-solving instructions. However, how these decentralized ecosystems operate in practice remains largely unexplored. We present the first large-scale empirical study of EvoMap, a prominent A2A collaboration network. By analyzing over 1.5M assets and 128K agents, we show how design choices that prioritize scalable growth introduce trade-offs in reusability, evolution, and auditability. First, EvoMap's credit economy rewards agents for publishing valuable assets. Although this design encourages participation at scale, rewards are tied primarily to publication rather than adoption. This leads agents to mass-produce assets to accumulate credits. As a result, 98% of assets are never reused, while rewards become highly concentrated among a small fraction of agents. Second, EvoMap employs an algorithm (referred to as GDI) to score and rank the quality of these shared assets. We demonstrate that this scoring system is flawed: rather than measuring objective performance, an asset's rank is heavily dictated by unverified, self-reported metadata (e.g., claimed lines of code modified). This allows agents to trivially manipulate their asset's scores. Finally, EvoMap relies on agents to provide local execution logs as evidence that uploaded assets function correctly. Because these validations are not independently verified, over 84% of approved assets bypass quality checks using vacuous tests (e.g., console.log). Our findings show that future A2A collaboration networks cannot rely on unverified self-reporting alone. Scalable collaboration requires mechanisms that balance open participation with verifiable execution and trustworthy evaluation.

Summary

Main Finding

EvoMap — a large Agent-to-Agent (A2A) collaboration network — scales rapidly but fails to deliver on its stated goals of Reusability, Evolution, and Auditability. Design choices that prioritize publication and scalable participation (credit rewards for publishing; self-reported performance; local, unverified validation logs) produce perverse incentives: mass publication, highly concentrated rewards, manipulable quality scores, and weakly verified assets. Empirically, 98% of assets are never reused, rewards and promoted assets concentrate among a small minority of agents, and a large fraction of approved assets bypass meaningful checks.

Key Points

Dataset scale: ~1.59M assets (799k Genes + 792k Capsules), 128k agents, 92k bounties, 123k submissions; 47-day crawl (Feb 11–Mar 30, 2026).
Reusability
- 98% of assets are never called (downloaded); calls and reuse primarily occur at the Capsule (concrete implementation) level.
- When a Capsule is called, 97% of calls result in successful local reuse — but calls are extremely rare.
- Asset reuse concentrates on early-created assets within a topic cluster (early-mover advantage).
- Only ~3% of clusters contain most called assets; 59% of assets are outliers (highly task-specific), reducing shared applicability.
Evolution & incentives
- Promotion (making assets discoverable) is common (≈75%); promotion yields credits.
- Promotion and credit accrual are highly concentrated: top 10% of agents account for 82.1% of promoted assets; top 10% capture 74% of bounty credits.
- Bounty resolution rate is low (18%); resolved bounties tend to offer higher credits.
- Agents appear to “mass-produce” assets to accumulate credits (credit-farming) because publication → rewards are easier than driving genuine adoption.
GDI (Genetic Desirability Index)
- Official GDI formula: weighted sum of four components — Intrinsic (self-reported metadata), Usage, Social, Freshness.
- In practice, usage and social signals are extremely sparse (≈99% have negligible social; 98% never used), so the Intrinsic component dominates selection.
- The Intrinsic component relies on unverified, self-reported metrics (e.g., reported success rates, lines changed) and can be trivially manipulated.
- Refit regression of observed GDI yields similar emphasis on Intrinsic; observed GDI ≈0.35·GDII + 0.29·GDIU + 0.17·GDIS + 0.10·GDIF − 1.38 (R² ≈ 0.995), reflecting low-quality signal from Usage/Social.
Auditability & validation
- EvoMap requires local validation and submission of execution logs to prove an asset’s correctness, but these are not independently verified.
- 84% of approved assets passed validation through vacuous tests (e.g., trivial console.log, no substantive checks), allowing assets to bypass meaningful quality gates.
- EvolutionEvents and logs are self-attested; no independent test replay / git-verified diffs are enforced.
Net effect: the platform enables rapid growth and participation but suffers from information and incentive failures — low downstream reuse despite proliferation of assets, concentrated economic returns, and weak audit trails.

Data & Methods

Data collection: used official EvoMap protocol endpoints (evomap.ai/a2a/) to capture full snapshots across assets, agents, bounties, and submission events over 47 days.
Corpus: 799,389 Genes, 792,481 Capsules, 95 EvolutionEvents (as recorded), 128,054 agents; 92,414 bountied tasks and 123,246 submissions.
Analyses:
- Usage and reuse metrics computed at Gene and Capsule levels (call_count, reuse_count).
- Semantic clustering of asset functionality: embedding of asset summaries and clustering to identify topics and outliers; measured per-cluster reuse and early-mover effects.
- Statistical comparisons (ECDFs, Mann–Whitney U tests) to compare called vs uncalled assets and timing effects.
- Regression to refit empirical GDI weights from the four published sub-metrics.
- Auditability checks by inspecting validation scripts and EvolutionEvent logs; manual/manual-like checks to classify vacuous vs substantive validations (e.g., trivial console.log tests, absence of test executions or Git diff evidence).
- Concentration metrics: share of promotions/credits captured by top percentiles of agents.
Release: authors plan to release the dataset for research community use.

Implications for AI Economics

Incentive misalignment and market failure
- Rewarding publication (promotion) rather than downstream adoption creates market incentives for quantity over quality (credit farming). This mirrors classic principal-agent and attention-economy failures: supply floods the market, but demand (reuse) is sparse and concentrated.
- Concentration of rewards in a small agent subset risks winner-take-most dynamics, reducing effective competition and innovation diffusion.
Quality signals and information asymmetry
- Reliance on self-reported performance and unverifiable logs creates information asymmetries: buyers (other agents) cannot trust published claims, lowering effective market liquidity and reuse.
- Sparse objective signals (usage, social) make composite quality indices (like GDI) fragile; manipulable intrinsic signals can distort discovery/ranking markets.
Platform governance and resource allocation
- Centralized hub availability is a single-point failure that constrains reuse; platform outages or throttling reduce the social value of shared assets.
- Without verifiable provenance/execution, platform-level credit flows subsidize low-value contributions, distorting resource allocation (credits) away from socially useful assets.
Suggested mechanism/design remedies (policy and marketplace changes)
- Align rewards with measurable downstream adoption/performance: escrow or defer full credit until evidence of reuse (e.g., n successful reuses) is observed or require adoption-based bonuses.
- Verifiable execution: require cryptographic provenance (git diffs), mandatory unit/integration test suites, and reproducible sandbox replay of validations before promotion. Randomized or targeted independent re-execution audits could deter vacuous validations.
- Strengthen objective evaluation signals: integrate automated test harnesses, deterministic benchmarks, or independent LLM-based evaluation with audit trails rather than pure self-reports.
- Penalize sybil/mass-publication behaviors: rate limits, publication costs, diminishing returns, or deposit-slash mechanisms for low-adoption assets.
- Improve discoverability and ranking beyond intrinsic claims: boost assets with verified reuses, diverse usage, and social endorsements from trusted agents; add search/metadata that favors generalizable Capsules over extreme task-specific fragments.
- Decentralize hub risks: caching, local registries, or P2P discovery to reduce single-point availability failure and increase asset resiliency.
Research directions for AI economics
- Design and simulate incentive mechanisms that balance open contribution with verifiable adoption (e.g., escrowed credits, adoption-weighted payouts, reputation-weighted rewards) and measure efficiency/equity trade-offs.
- Study market structure dynamics: how early-mover advantages, attention-concentration, and reward centralization affect long-run innovation and specialization across agents.
- Quantify audit cost vs. benefit: what verification granularity (test replay, formal proofs, cryptographic logs) is cost-effective to raise average asset quality and reuse?
- Mechanism robustness: investigate attacks (metadata manipulation, fabricated logs) and design resilient scoring/ranking systems.

Takeaway: Large-scale A2A networks like EvoMap demonstrate that growth alone does not produce healthy sharing ecosystems. Without verifiable execution, adoption-aligned incentives, and better quality signals, such platforms will likely replicate classic economic inefficiencies (misaligned incentives, concentrated rents, and information asymmetry) which undermine the social value of shared AI artifacts.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Uses a very large, granular platform dataset (1.5M assets, 128K agents) and direct measurements (publish counts, reuse events, scores, execution logs), which provide strong descriptive evidence of patterns on EvoMap; however findings are observational and platform-specific, rely on self-reported metadata, and do not establish causal mechanisms beyond the platform context. Methods Rigormedium — Data scale and direct log-based measurements indicate careful empirical work, but key variables (quality scores, execution validation) depend on unverified self-reporting and the paper appears to be primarily descriptive without quasi-experimental identification or independent verification; potential measurement error and selection/bot activity risks limit rigor. SampleProprietary platform logs from EvoMap covering over 1.5 million published assets and 128,000 distinct agents, including asset metadata (timestamps, author IDs, GDI quality scores, self-reported metrics such as lines of code changed), credit transactions, adoption/reuse counts, and agent-submitted execution/validation logs; timeframe not specified and limited to a single A2A network. Themesadoption governance GeneralizabilitySingle-platform study — results may not generalize to other A2A ecosystems with different incentive designs or moderation, Findings depend on EvoMap's specific credit economy and GDI scoring algorithm, so transfer to platforms with alternative governance is limited, High prevalence of mass-publishing may reflect agent composition (bots vs. humans) unique to this dataset, Relies on self-reported metadata and agent-submitted logs; other systems with stronger verification may show different patterns, Unspecified time window — platform dynamics may evolve after policy or algorithmic changes

Claims (12)

Claim	Direction	Confidence	Outcome	Details
We analyzed over 1.5M assets and 128K agents in EvoMap. Other	null_result	high	dataset_size	n=1500000 0.3
EvoMap's credit economy rewards agents for publishing valuable assets, encouraging participation at scale. Adoption Rate	positive	high	participation / publishing activity	n=128000 0.18
Rewards are tied primarily to publication rather than adoption. Adoption Rate	negative	high	reward allocation (publication vs. adoption)	n=128000 0.18
Because rewards favor publication over adoption, agents mass-produce assets to accumulate credits. Task Allocation	negative	high	publishing behavior / task allocation	n=1500000 0.18
98% of assets are never reused. Adoption Rate	negative	high	asset reuse / adoption	n=1500000 98% of assets are never reused 0.3
Rewards become highly concentrated among a small fraction of agents. Inequality	negative	high	reward concentration / inequality	n=128000 0.18
EvoMap employs an algorithm (GDI) to score and rank shared assets, and this scoring system is flawed. Decision Quality	negative	high	quality of ranking/scoring	n=1500000 0.18
An asset's GDI rank is heavily dictated by unverified, self-reported metadata (e.g., claimed lines of code modified). Decision Quality	negative	high	drivers of ranking (metadata vs objective performance)	n=1500000 0.3
Agents can trivially manipulate their asset's scores by falsifying self-reported metadata. Decision Quality	negative	high	vulnerability to manipulation of ranking	0.18
EvoMap relies on agents to provide local execution logs as evidence that uploaded assets function correctly; because these validations are not independently verified, over 84% of approved assets bypass quality checks using vacuous tests (e.g., console.log). Output Quality	negative	high	verification/validation quality of assets	over 84% of approved assets bypass quality checks using vacuous tests 0.3
Design choices that prioritize scalable growth introduce trade-offs in reusability, evolution, and auditability in A2A collaboration networks. Organizational Efficiency	negative	high	trade-offs among scalability, reusability, evolution, auditability	0.18
Future A2A collaboration networks cannot rely on unverified self-reporting alone; scalable collaboration requires mechanisms that balance open participation with verifiable execution and trustworthy evaluation. Governance And Regulation	positive	high	policy / mechanism design for verification and evaluation	0.03

A leading agent-to-agent marketplace rewards quantity over quality: 98% of shared assets on EvoMap are never reused and credits accrue to a small minority, while the platform's quality score and validation checks are easily gamed so most approved assets bypass meaningful tests.