AI firms selectively showcase benchmarks to shape perceptions rather than standardize measurement: 63% of highlighted tests appear in only one builder's release and many 'general knowledge' evaluations chiefly test STEM skills while being framed as broad progress toward AGI.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

Stefan Baack, Christo Buschek, Maty Bohacek · May 13, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Model builders predominantly use highlighted benchmarks as narrative tools rather than standardized measurement: most benchmarks are unique to a single builder or single release, categories like 'general knowledge application' are vague and often focus on STEM tests, and results are framed to signal progress toward AGI rather than construct-valid measurement.

The primary way to establish and compare competencies in foundation and generative AI models has shifted from peer-reviewed literature to press releases and company blog posts, where model builders highlight results on selected benchmarks. These artifacts now largely define the state of the art for researchers and the public. Despite their prominence, which benchmarks model builders choose to highlight, and what they communicate through this selection, is underexamined. To investigate, we introduce and open-source Benchmarking-Cultures-25, a dataset of 231 benchmarks highlighted across 139 model releases in 2025 from 11 major AI builders, alongside an interactive tool to explore the data. Our analysis reveals a fragmented evaluation landscape with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder, and 38.5% appear in just one release. Few achieve widespread use (e.g., GPQA Diamond, LiveCodeBench, AIME 2025). Moreover, benchmarks are attributed different competencies by different builders, depending on their narrative. To disentangle these conflicting presentations, we develop a unified taxonomy mapping diverging terminology to a shared framework of measured signals based on what benchmark authors claim to measure. "General knowledge application" is the second most popular, yet vaguely defined, category. Qualitative analysis shows many such benchmarks deemphasize construct validity, instead framing results as indicators of progress toward AGI. Their authors claim to measure knowledge or reasoning broadly, yet mostly evaluate STEM subjects (especially math). We argue that highlighted benchmarks function less as standardized measurement tools and more as flexible narrative devices prioritizing market positioning over scientific evaluation. Data: https://hf.co/datasets/matybohacek/benchmarking-cultures-25; tool: https://bench-cultures.net.

Summary

Main Finding

Model-release benchmark highlights function less as standardized measurement tools and more as flexible narrative devices that firms use to signal progress and position products. The industry’s public benchmarking landscape is fragmented and inconsistent—most highlighted benchmarks are proprietary to a single builder or single release, benchmark authorship is increasingly industry-led, and the same benchmark is frequently attributed different competencies depending on the publisher’s narrative. This reduces cross-model comparability and creates incentives that favor marketing and competitive signaling over robust, generalizable scientific evaluation.

Key Points

Dataset & scope: Benchmarking-Cultures-25 compiles 231 unique benchmarks that were highlighted across 139 generative-AI model releases (2025) from 11 major model builders.
High fragmentation: 63.2% of highlighted benchmarks are used by a single model builder; 38.5% appear in only one model release.
Few widely-adopted benchmarks: only a small subset (e.g., GPQA Diamond, LiveCodeBench, AIME 2025) achieve broad uptake.
Industry-led benchmark production: 43.9% of benchmark authors are industry-affiliated overall; for benchmarks published in 2025 this rises (~49.3%), and Western builders show even higher industry share (64.5% for 2025).
Inconsistent framing: model builders label the same benchmark with different competencies (e.g., LiveCodeBench presented as coding, reasoning, agent-related, or cost-efficiency), sometimes inconsistently across releases by the same builder.
Popular evaluations are narrow and STEM-biased: among the top 15 most-used benchmarks, 41.7% evaluate math; many “general knowledge” claims are operationalized via STEM/math tasks.
AGI rhetoric: many “general knowledge application” benchmarks deemphasize construct validity and are framed (explicitly or implicitly) as indicators of progress toward AGI.
Consequences for measurement quality: issues linked to Goodhart’s law, data contamination, static benchmark saturation, and lack of construct validity undermine benchmarks’ utility as objective comparators.

Data & Methods

Data collected from public-facing release artifacts (press releases, company blog posts) for 139 generative model releases in 2025 from 11 top builders (industry labs + independent/research orgs; includes Western and Chinese builders).
Manual extraction of every benchmark explicitly mentioned in each primary release announcement, with normalization rules for metric variants and ambiguous dynamic benchmarks.
Resulting dataset: 231 unique benchmarks; extended schema includes models, benchmarks, highlights, affiliations, categories, categorizations, and knowledge subjects (seven data frames, 44 fields).
Taxonomy: inductively developed unified taxonomy of tested competencies by extracting what benchmark authors claim to measure; eight meta-categories and 22 subcategories used to annotate benchmarks (manual annotation, reviewed by coauthors).
Outputs: open-source dataset (Benchmarking-Cultures-25) and an interactive visualization tool for exploring benchmark–model relationships.
Key limitations: single-year (2025) snapshot; focus only on public release artifacts (excluding model cards, docs, internal evaluations); taxonomy annotations were primarily done by a single author (though reviewed), and broader qualitative coverage of categories was limited.

Implications for AI Economics

Market signaling and competitive positioning
- Benchmarks are a low-cost, high-impact channel for firms to signal superiority and differentiate products. Because highlighted benchmarks are curated and narrative-driven, firms can strategically select or frame benchmarks to influence buyer perception, developer adoption, and investor sentiment.
- Fragmented and inconsistent benchmark usage raises search and comparison costs for buyers (enterprises, API users, regulators), weakening price competition based on verifiable capability metrics and increasing reliance on brand/reputation.
Incentives, rent-seeking, and resource allocation
- Industry-produced benchmarks and the rapid industry uptake of self-authored metrics create incentives to optimize for favorable public metrics (Goodhart effects). Firms may prioritize features and training that improve marketed benchmark outcomes over investments that yield real-world utility, potentially misallocating R&D resources across the sector.
- The prevalence of benchmarking as a performative marketing device may amplify winner-takes-most dynamics: early adopters of favorable benchmarks can capture disproportionate mindshare, leading to funding, talent attraction, and market power that reinforce itself.
Investment, procurement, and contracting risks
- Purchasers (enterprises, governments) relying on release benchmarks for procurement or vendor selection may make suboptimal or risky choices because benchmarks are neither standardized nor necessarily indicative of field performance. This increases the value of independent third-party evaluation in procurement decisions.
- Investors and financial analysts that use publicized benchmark performance to value firms or products face higher information asymmetry and greater downside risk from overstated claims.
Regulation and standardization pressures
- Fragmentation and marketing-driven benchmarking strengthen the case for neutral, standardized evaluation frameworks (and independent testing bodies) to reduce information asymmetries and enable more comparable assessments for competition policy, procurement, and safety regulation.
- Policy instruments could include requirements for disclosure (benchmark definitions, test-set provenance, contamination checks), third-party audits, or standardized evaluation suites for certain procurement classes (e.g., sensitive enterprise or government use).
Potential externalities and public-goods failure
- Because benchmarking content and standards often have public-good characteristics, the shift toward industry-authored benchmarks risks underprovision of broadly credible, neutral benchmarks. Collective action (industry consortia, academic–public partnerships, or standards bodies) may be needed to produce robust, continuously updated benchmarks.
Practical recommendations for economic actors
- For firms: invest in transparent benchmarking practices that emphasize construct validity and contamination controls—this can reduce reputational risk and improve comparability.
- For buyers/investors: prefer independent or standardized benchmark reports; treat single-vendor highlighted metrics with caution and demand traceability.
- For regulators and standard-setters: promote governance mechanisms (disclosures, independent testing, dynamic benchmark suites) to restore comparability and reduce gaming incentives.
- For the research/public sector: fund and maintain public benchmark infrastructure (dynamic, contamination-aware) to counterbalance proprietary, narrative-driven metrics.

Overall, the paper shows benchmarking practices in 2025 function as strategic market instruments as much as evaluation tools. For AI economics, this implies distorted signals in markets for models, higher asymmetry for purchasers and investors, and a growing need for neutral benchmarking institutions and policy interventions to align incentives toward robust, socially useful evaluation.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper assembles a novel, open dataset (231 benchmarks across 139 2025 releases from 11 major builders) and reports clear descriptive statistics and qualitative coding, giving credible evidence about benchmarking practices; however, it is observational, limited to publicly highlighted benchmarks, and cannot establish causal claims about impacts on markets or adoption. Methods Rigormedium — Methods combine systematic collection of press releases/blog posts, quantitative counts, a unified taxonomy, and qualitative coding—appropriate for descriptive aims and reproducible via released data and tools—but the coding/taxonomy choices are inherently interpretive, coverage is constrained to selected builders and 2025 releases, and some selection and measurement biases remain. SampleOpen-source dataset of 231 benchmarks called out in 139 model releases during 2025 from 11 major AI builders (links to dataset and interactive tool provided); includes benchmark names, counts of appearances across builders/releases, authors' claimed measurement signals, mapped taxonomy categories, and qualitative notes on framing and construct validity. Themesgovernance adoption GeneralizabilityLimited to benchmarks highlighted publicly in press releases and blog posts (excludes internal evals, third-party usage, or non-highlighted benchmarks)., Restricted to 2025 releases and 11 major builders—may miss smaller labs, later/earlier years, or regional differences., English-language and public-communications bias likely; non-public evaluation practices and technical reports not fully captured., Taxonomy and qualitative coding include subjective judgments that could vary with different coders or alternative frameworks.

Claims (11)

Claim	Direction	Confidence	Outcome	Details
The primary way to establish and compare competencies in foundation and generative AI models has shifted from peer-reviewed literature to press releases and company blog posts, where model builders highlight results on selected benchmarks. Adoption Rate	mixed	high	medium of public evaluation (peer-reviewed literature vs press releases/company blog posts)	0.18
We introduce and open-source Benchmarking-Cultures-25, a dataset of 231 benchmarks highlighted across 139 model releases in 2025 from 11 major AI builders, alongside an interactive tool to explore the data. Other	positive	high	size and coverage of the released dataset	n=231 231 benchmarks; 139 model releases; 11 builders 0.3
The evaluation landscape is fragmented with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder. Adoption Rate	negative	high	degree of cross-model benchmark reuse (benchmarks per builder)	n=231 63.2% 0.3
38.5% of highlighted benchmarks appear in just one release. Adoption Rate	negative	high	durability/reuse of benchmarks across releases	n=231 38.5% 0.3
Few benchmarks achieve widespread use (examples given include GPQA Diamond, LiveCodeBench, AIME 2025). Adoption Rate	neutral	high	frequency of benchmark highlighting across builders/releases	n=231 0.18
Benchmarks are attributed different competencies by different builders, depending on their narrative. Research Productivity	mixed	high	consistency of competency attributions across builders	n=139 0.18
We develop a unified taxonomy mapping diverging terminology to a shared framework of measured signals based on what benchmark authors claim to measure. Other	positive	high	harmonization/taxonomy of benchmark labels	0.18
"General knowledge application" is the second most popular category among highlighted benchmarks, yet it is vaguely defined. Adoption Rate	mixed	high	frequency/popularity of taxonomy categories (rank of 'General knowledge application')	n=231 second most popular category 0.18
Qualitative analysis shows many 'general knowledge application' benchmarks deemphasize construct validity, instead framing results as indicators of progress toward AGI. Research Productivity	negative	high	degree of attention to construct validity vs AGI-framing in benchmark narratives	0.18
Authors of many 'general knowledge application' benchmarks claim to measure knowledge or reasoning broadly, yet mostly evaluate STEM subjects (especially math). Research Productivity	negative	high	topical focus of benchmark content (STEM/math prevalence) versus stated measurement claims	n=231 0.18
Highlighted benchmarks function less as standardized measurement tools and more as flexible narrative devices prioritizing market positioning over scientific evaluation. Market Structure	negative	high	primary function of highlighted benchmarks (standardized measurement vs narrative/marketing device)	0.18