AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment

Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment. We introduce AgentPulse, a continuous evaluation framework scoring 50 agents across 10 workload categories along four factors (Benchmark Performance, Adoption Signals, Community Sentiment, and Ecosystem Health) aggregated from 18 real-time signals across GitHub, package registries, IDE marketplaces, social platforms, and benchmark leaderboards. Three analyses ground the framework. The four factors capture largely complementary information (n=50; $ρ_{\max}=0.61$ for Adoption-Ecosystem, all others $|ρ| \leq 0.37$). A circularity-controlled test (n=35) shows the Benchmark+Sentiment sub-composite, which contains no GitHub-derived signals, predicts external adoption proxies it does not aggregate: GitHub stars ($ρ_s=0.52$, $p<0.01$) and Stack Overflow question volume ($ρ_s=0.49$, $p<0.01$), with VS Code installs ($ρ_s=0.44$, $p<0.05$) reported as illustrative given that only 11 of 35 agents have non-zero installs. On the n=11 subset with published SWE-bench scores, composite and benchmark-only rankings are nearly uncorrelated ($ρ_s=0.25$; 9 of 11 agents shift by at least 2 ranks), driven by a strong negative Adoption-Capability correlation among closed-source high-capability agents within this subset. This is precisely why we rest the framework's validity claim on the broader n=35 test rather than the SWE-bench overlap. AgentPulse surfaces deployment signal absent from benchmarks; it is a methodology, not a ground-truth ranking. The framework, all collected signals, scoring outputs, and evaluation harness are released under CC BY 4.0.

Summary

Main Finding

AgentPulse is a continuous, multi-signal evaluation framework that complements static capability benchmarks by surfacing deployment and adoption signals for AI agents. Aggregating 18 real‑time signals into four factors (Benchmark Performance, Adoption Signals, Community Sentiment, Ecosystem Health) for 50 agents, AgentPulse shows the four factors are largely complementary and—critically—a sub-composite built only from benchmarks + sentiment predicts independent adoption proxies (GitHub stars, Stack Overflow question volume, and VS Code installs) in a circularity-controlled test (n=35). The framework and all data/harness are publicly released under CC BY 4.0.

Key Points

Framework structure:
- Composite AP(a) = wB·B + wA·A + wS·S + wE·E with default weights wB=0.35, wA=0.25, wS=0.20, wE=0.20.
- Designed as a principled prior and supports custom reweighting.
Signals and coverage:
- 50 agents tracked across 10 workload categories.
- 18 signals from leaderboards, GitHub, package registries (PyPI/npm), IDE marketplaces (VS Code), social platforms (Reddit, HN, Bluesky, Mastodon, Stack Overflow), and benchmark leaderboards.
- NLP pipeline for sentiment uses ensemble tools (VADER, TextBlob, FinBERT, DistilBERT-SST2) with sarcasm detection, engagement and credibility weighting, deduplication, etc.
Factor-level relationships (n=50):
- Adoption–Ecosystem: ρ = 0.61 (moderate correlation).
- Other pairwise correlations |ρ| ≤ 0.37, e.g., Benchmark–Adoption ρ = 0.05 (near zero), Sentiment–Adoption ρ = −0.29.
- Conclusion: factors capture largely complementary information.
Circularity-controlled predictive validity (n=35 with public GitHub repos):
- Benchmark+Sentiment (excluding Adoption & Ecosystem) predicts external adoption proxies:
  - GitHub stars (log): Spearman ρs = 0.52, p < 0.01
  - Stack Overflow question volume: ρs = 0.49, p < 0.01
  - VS Code installs (log): ρs = 0.44, p < 0.05 (methodologically thinner: only 11 of 35 agents had non-zero installs)
- Interpretation: benchmark capability combined with community sentiment carries information that manifests in developer adoption, independent of direct GitHub-derived inputs.
Ranking divergence vs. capability benchmarks:
- On the 11 agents with SWE-bench Verified scores, composite vs. benchmark-only rankings diverge substantially (ρs = 0.25; 9 of 11 agents shift ≥2 ranks). This is largely driven by closed-source high-capability agents having limited observable adoption under the framework’s measurement boundary.
Ablation insights (n=11 SWE subset; diagnostic):
- Full composite correlation with SWE-bench: ρ = 0.03.
- Benchmark-only correlation with SWE-bench: ρ = 0.95.
- Removing Adoption increases ρ to 0.57; removing Benchmark yields ρ = −0.33.
- Caution: the n=11 subset over-represents closed-source high-capability agents; these ablation results are diagnostic and highlight a structural tension between adoption signals and capability when closed-source agents lack observable adoption metrics.
Design and limits:
- No pricing factor (to avoid structural bias).
- Closed-source measurement boundary: agents without public repos/marketplace presence receive zero on observable adoption sub-signals (a measurement choice, not a quality judgment).
- Framework is methodological (not a definitive ground-truth ranking) and intended to be extensible and continuously updated.

Data & Methods

Registry and scope:
- 50 agents across five functional groups and 10 workload categories (e.g., coding, SWE, browser agents, multi-agent frameworks).
- Collection cadence ranges from 5 minutes to 24 hours per signal; pipeline runs autonomously and starts collecting within hours of registry entry.
Signals (18 total, grouped by factor):
- Benchmark: published scores from SWE-bench, GAIA, WebArena, HumanEval+, TAU-bench.
- Adoption: GitHub stars (+velocity), PyPI/npm downloads, Docker pulls, VS Code installs and rating.
- Community Sentiment: social media and forum text (Bluesky, Reddit, HN, Stack Overflow, GitHub Discussions, Mastodon, Dev.to, V2EX, Lemmy) processed with ensemble sentiment/NLP pipeline; engagement-weighted, sarcasm detection, platform credibility weighting.
- Ecosystem Health: contributors (log-normalized), GitHub issue close rate, days-since-last-update decay, VS Code rating, doc-depth proxy, enterprise-readiness composite.
Normalization and formulas:
- Benchmarks normalized to [0,1]; missing benchmarks receive neutral prior 0.5.
- Adoption metrics log-normalized with empirically chosen ceilings.
- Sentiment rescaled from mean composite into [0,1] via an affine transform (clamp(mean*2.5 + 0.5, 0, 1)).
- Ecosystem computed as weighted combination of contributor depth, close rate, recency, and VS Code rating.
Validation strategy:
- Factor independence: Spearman correlations across full registry (n=50).
- Circularity-controlled predictive validity: compute a sub-composite excluding GitHub-derived Adoption & Ecosystem; test against independent GitHub/VS Code/StackOverflow adoption proxies on n=35 agents with public repos.
- Ranking divergence: exploratory descriptive comparison between composite and SWE-bench ranks on n=11 agents with SWE-bench Verified scores.
- Sensitivity: ±10 percentage point perturbations in factor weights; bootstrapped score intervals (1,000 resamples).
Release: framework, pipeline, collected signals, scored texts, and evaluation harness released under CC BY 4.0.

Implications for AI Economics

Adoption ≠ capability: empirical decoupling of benchmark capability and adoption implies market outcomes (adoption, monetization, developer choice) depend on integration, packaging, community, and ecosystem factors beyond raw task performance.
- Economic analyses of agent markets must include non‑capability signals (distribution channels, tooling integrations, ecosystem health) when modeling diffusion, market share, or consumer surplus.
Measurement and valuation biases:
- Open-source presence and exposed signals materially influence measured adoption. Valuation approaches (e.g., market sizing, investment due diligence, M&A) that rely on public observables will systematically favor agents/companies that expose repositories, packages, or extensions.
- Conversely, closed‑API, proprietary agents may be under‑counted in such metrics; economists and investors should correct for this measurement boundary.
Continuous monitoring and dynamic pricing/product strategy:
- Frequent updates and short release cadences suggest static snapshots miss temporal dynamics. Continuous composites like AgentPulse can be used to study temporal adoption elasticities, the impact of new integrations or releases on demand, and optimal timing for pricing/promotions.
Platform and competition dynamics:
- Adoption and ecosystem health signals (contributors, issue closure, marketplace installs) can serve as leading indicators of platform stickiness and developer switching costs. Useful for modeling platform competition, network effects, and multi‑sided market value accrual.
Signaling, reputation, and externalities:
- Community sentiment (engagement-weighted) materially predicts adoption beyond benchmarks. Economists can study reputational dynamics and information externalities: how sentiment propagates adoption, how negative sentiment scales with user base (the slight negative Adoption–Sentiment correlation), and strategic responses by firms.
Policy and procurement:
- Public procurement or regulation that wishes to select high-quality agentic tools should combine capability benchmarks and deployment signals to account for reliability, maintainability, and ecosystem support—dimensions crucial for operational risk assessments.
Research and empirical opportunities:
- Use AgentPulse outputs to study causal drivers of adoption (instrumental variables or natural experiments when integrations or releases occur), competition between open and closed models, labor impacts (e.g., plugin adoption vs. developer productivity), and the economics of developer tooling markets.
Caveats and recommended uses:
- AgentPulse is a methodological tool—not ground truth. Correlations are not causal. Closed-source measurement boundary and omission of pricing mean that users should reweight factors or combine AgentPulse with proprietary/financial data when making firm-level inferences.
- For valuation and policy decisions, supplement AgentPulse with private telemetry (if available) and qualitative assessment of pricing, SLAs, and contractual commitments.

If you want, I can - produce a one‑page visual summary for presentations, or - extract the specific signal definitions and normalization constants into a compact table for use in economic models.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Uses multiple real-time public signals and reports statistically significant correlations with external adoption proxies, and includes a circularity-controlled test; however, sample sizes are small (n=50, n=35, n=11 subsets), selection of agents may be non-random, and results are correlational not causal. Methods Rigormedium — Methodologically careful in assembling diverse, time-varying signals, controlling for obvious circularity (excluding GitHub-derived signals in a validation test), and reporting correlation coefficients and p-values; but weighting/aggregation choices, robustness checks, sensitivity to signal selection/time window, and potential biases from platform-specific data are not fully resolved in the description. Sample50 AI agents spanning 10 workload categories scored on 18 real-time signals drawn from GitHub, package registries, IDE marketplaces (e.g., VS Code installs), social platforms, and benchmark leaderboards; analyses include the full n=50 factor structure, a circularity-controlled validation on n=35 (excluding GitHub-derived signals), and an n=11 overlap with published SWE-bench scores. Themesadoption productivity innovation IdentificationNo causal identification; validity assessed via correlational analyses and a circularity-controlled prediction test where a sub-composite (Benchmark+Sentiment) is used to predict external adoption proxies (GitHub stars, Stack Overflow volume, VS Code installs). GeneralizabilitySmall and potentially non-random sample of 50 agents may not represent the broader population of AI products, Platform-specific signals (GitHub, VS Code, Stack Overflow, package registries) bias results toward developer-facing and open-source agents, Time-varying 'real-time' signals may not be stable, limiting temporal generalizability, Findings may not extend to closed-source, enterprise, or non-developer-facing AI systems, Correlational design prevents causal claims about adoption drivers or economic impacts

Claims (11)

Claim	Direction	Confidence	Outcome	Details
Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment. Adoption Rate	negative	high	scope of measurement of static benchmarks (capability vs. deployment/adoption)	0.03
We introduce AgentPulse, a continuous evaluation framework scoring 50 agents across 10 workload categories along four factors (Benchmark Performance, Adoption Signals, Community Sentiment, and Ecosystem Health) aggregated from 18 real-time signals across GitHub, package registries, IDE marketplaces, social platforms, and benchmark leaderboards. Adoption Rate	positive	high	AgentPulse composite and factor scores (Benchmark Performance, Adoption Signals, Community Sentiment, Ecosystem Health)	n=50 0.18
The four factors capture largely complementary information (n=50; ρ_max = 0.61 for Adoption-Ecosystem, all others \|ρ\| ≤ 0.37). Adoption Rate	mixed	high	inter-factor correlations (Adoption vs Ecosystem and other factor pairs)	n=50 ρ_max=0.61; all others \|ρ\| ≤ 0.37 0.18
A circularity-controlled test (n=35) shows the Benchmark+Sentiment sub-composite, which contains no GitHub-derived signals, predicts external adoption proxies it does not aggregate: GitHub stars (ρ_s=0.52, p<0.01). Adoption Rate	positive	high	GitHub stars (external adoption proxy)	n=35 ρ_s=0.52, p<0.01 0.18
The Benchmark+Sentiment sub-composite predicts Stack Overflow question volume (ρ_s=0.49, p<0.01) in the circularity-controlled test (n=35). Adoption Rate	positive	high	Stack Overflow question volume (external adoption/engagement proxy)	n=35 ρ_s=0.49, p<0.01 0.18
The Benchmark+Sentiment sub-composite correlates with VS Code installs (ρ_s=0.44, p<0.05), reported as illustrative given that only 11 of 35 agents have non-zero installs. Adoption Rate	positive	high	VS Code installs (IDE install counts as adoption proxy)	n=35 ρ_s=0.44, p<0.05 (only 11 of 35 agents have non-zero installs) 0.09
On the n=11 subset with published SWE-bench scores, composite and benchmark-only rankings are nearly uncorrelated (ρ_s=0.25). Adoption Rate	null_result	high	rank correlation between composite ranking and benchmark-only ranking	n=11 ρ_s=0.25 0.09
Within that n=11 subset, 9 of 11 agents shift by at least 2 ranks between composite and benchmark-only rankings. Adoption Rate	mixed	high	count/proportion of agents with ≥2-rank shifts	n=11 9 of 11 agents shift by at least 2 ranks 0.09
The near-uncorrelated rankings and rank shifts on the n=11 subset are driven by a strong negative Adoption-Capability correlation among closed-source high-capability agents within this subset. Adoption Rate	negative	high	Adoption-Capability correlation among closed-source high-capability agents	n=11 0.09
AgentPulse surfaces deployment signal absent from benchmarks; it is a methodology, not a ground-truth ranking. Adoption Rate	positive	high	presence of deployment/adoption signals not captured by standard benchmarks	0.18
The framework, all collected signals, scoring outputs, and evaluation harness are released under CC BY 4.0. Adoption Rate	positive	high	availability/license of framework and data	0.3

AgentPulse: a continuous scoring system shows public adoption and community signals reveal deployment-relevant information that static benchmarks miss; those signals predict GitHub stars, Stack Overflow activity and IDE installs even where capability benchmarks do not.