The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Leading LLMs disagree about which startups to back: GPT-4o, Claude 3.5 Sonnet and DeepSeek-V2 yield systematically different evaluation scores, funding recommendations and confidence levels on the same 20 pitch decks, and show wide variation in reliability (ICC range 0.24–0.93), implying model choice can materially shape investment outcomes.

Algorithmic personalities and the myth of neutrality: financial behavior of large language models in investment decision-making
Duang-kamol Buranasomphop, Shih-Wei Wu, Wei-Lun Chang · April 27, 2026 · Digital Finance
openalex descriptive medium evidence 7/10 relevance DOI Source PDF
Across 20 real startup pitch decks and five repeated runs per deck, three leading LLMs produced systematically different funding recommendations, evaluation scores, and confidence levels, with reliability varying widely (ICC 0.24–0.93).

The rapidly growing use of large language models (LLMs) in high-stakes settings, such as venture capital screening, often relies on an implicit assumption that sufficiently advanced models will produce broadly comparable outputs. This study revisits that assumption and finds limited support for it. Using three leading models—GPT-4o, Claude 3.5 Sonnet, and DeepSeek-V2—we observe systematic and statistically significant differences in how investment evaluations are formed through a controlled simulation design; each model evaluated 20 real startup pitch decks spanning multiple industries and funding stages. To account for stochastic variation in outputs, each model pair was evaluated five times under identical conditions. This allows us to distinguish between one-off variation and more persistent behavioral tendencies. The results reveal consistent, reproducible differences across models in funding recommendations, evaluation scores, and expressed confidence. Also, reliability varies substantially across models, with ICC values ranging from 0.240 to 0.930. This suggests that model performance is not only about average behavior, but also about the stability of that behavior under repeated evaluation. Three behavioral profiles emerge. GPT-4o can be characterized as a cautious allocator, combining relatively favorable evaluations with conservative funding decisions. DeepSeek-V2 appears as a conservative scorer, applying more stringent and highly consistent evaluations while systematically underfunding. Claude 3.5 Sonnet aligns with a narrative funder profile, showing greater responsiveness to qualitative aspects of the pitch, somewhat higher funding levels, and strong cross-run reliability. These findings indicate that different models embed different evaluation logics, and these differences are large enough to shape outcomes in practice. Given the limited sample size, the results should be interpreted as exploratory. Even so, they point to the importance of incorporating reliability alongside average performance when assessing and deploying LLMs in high-stakes decision contexts.

Summary

Main Finding

Different large language models (GPT-4o, Claude 3.5 Sonnet, DeepSeek‑V2) produce systematically different investment evaluations in a controlled venture‑screening simulation. Differences are statistically significant and reproducible across repeated runs: models vary not only in average recommendations (funding amounts, scores, expressed confidence) but also in the stability of those recommendations (ICC reliability 0.240–0.930). Three distinct "financial personalities" emerge — cautious allocator (GPT‑4o), conservative scorer (DeepSeek‑V2), and narrative funder (Claude 3.5 Sonnet) — implying model choice materially shapes funding outcomes.

Key Points

  • Experimental evidence contradicts the “interchangeable LLM” assumption: model choice materially affects high‑stakes outputs.
  • Models differ along multiple margins: funding recommendation, evaluation scores, sensitivity to qualitative narrative, and expressed confidence.
  • Reliability matters: some models produce stable judgments across repeated runs (high ICC), others are noisy (low ICC). Reported ICC range: 0.240–0.930.
  • Behavioral archetypes identified:
    • GPT‑4o: favorable evaluations but conservative funding decisions — “cautious allocator.”
    • DeepSeek‑V2: stringent, highly consistent scoring but systematically lower funding — “conservative scorer.”
    • Claude 3.5 Sonnet: responsive to qualitative aspects, somewhat higher funding, strong cross‑run reliability — “narrative funder.”
  • Practical implication: average performance metrics alone are insufficient; operational stability and behavioral tendencies should be incorporated into model selection.
  • Caveat: sample is exploratory (20 real pitch decks); findings are suggestive rather than definitive.

Data & Methods

  • Design: simulation‑based experimental study using authentic startup pitch decks to preserve external validity while enabling controlled comparisons.
  • Models compared: GPT‑4o, Claude 3.5 Sonnet, DeepSeek‑V2.
  • Stimuli: 20 real startup pitch decks spanning industries and funding stages.
  • Repeats to separate noise from persistent tendencies: models were evaluated multiple times under identical conditions (five repeated runs reported to account for stochastic output variation).
  • Outputs collected: funding recommendation (likely numeric / binary), overall evaluation scores, expressed confidence, qualitative comments.
  • Analysis:
    • Statistical comparison across models to test systematic differences in recommendations and scores.
    • Reliability quantified via intraclass correlation coefficients (ICC) to measure stability across repeated runs (reported ICCs between 0.240 and 0.930).
  • Interpretation framed through a proposed “financial personality” framework that maps behavioral tendencies to archetypal profiles.

Implications for AI Economics

  • Model non‑neutrality can alter capital allocation: choosing different LLMs can systematically change which startups receive funding and how much, potentially affecting which technologies and firms scale.
  • Selection of foundation models is a strategic economic decision, not a purely technical procurement choice—firms should match model behavioral profiles to their organizational objectives (e.g., risk tolerance, desire for narrative sensitivity).
  • Market‑level effects: if many investors adopt the same model, its embedded biases/preferences could amplify selection effects and reduce diversity in funded ideas, shaping innovation trajectories.
  • Evaluation metrics for procurement and regulation should include reliability/stability (ICC or similar) alongside average accuracy/performance measures.
  • Governance and procurement recommendations:
    • Benchmark models on task‑specific behavioral tendencies, not just standard ML metrics.
    • Use reliability testing (repeated runs) to understand stochasticity and persistence of behaviors.
    • Consider ensembles or model mixes to diversify behavioral tendencies and mitigate single‑model distortions.
    • Implement auditing, monitoring, and human‑in‑the‑loop checks for high‑stakes decisions.
  • Research and policy priorities: expand empirical evidence (larger sample sizes, more models, field deployments), quantify long‑run effects of model‑driven allocation on innovation ecosystems, and consider disclosure requirements for model‑driven decision systems in financial contexts.

Limitations to bear in mind: exploratory study with 20 pitch decks; results indicate sizable effects but require larger, multi‑setting replication before strong generalization.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The study uses a controlled simulation with repeated runs and reports systematic, statistically significant differences and reliability metrics (ICCs), which provides credible descriptive evidence that LLMs differ in evaluative behavior; however, the small sample (20 decks), limited number of runs, lack of human or market benchmarks, and potential sensitivity to prompts/temperature limit how strongly the results can be generalized or taken as causal evidence of real-world effects. Methods Rigormedium — Design strengths include use of multiple leading models, repeated evaluations to separate stochastic from persistent tendencies, and reporting of reliability (ICC) and statistical comparisons; limitations include small, non-random sample of decks, unspecified prompt engineering and temperature controls, possible unreported preprocessing, absence of counterfactuals or human/VC comparators, and lack of robustness checks across more seeds, model versions, or deck sets. SampleThree LLMs (GPT-4o, Claude 3.5 Sonnet, DeepSeek-V2) each evaluated the same set of 20 real startup pitch decks spanning multiple industries and funding stages; each model produced repeated outputs (five runs per deck) under identical conditions, yielding ~100 outputs per model (300 total); recorded outputs included funding recommendation, numerical evaluation scores, and expressed confidence. Themesinnovation adoption human_ai_collab GeneralizabilitySmall, non-random sample of 20 pitch decks limits representativeness across industries, stages, and business models, Only three specific model versions tested; results may not hold for other LLMs or later model updates, Findings depend on prompt wording, temperature and interface settings which are not fully generalizable, Simulated evaluation setting lacks interaction with human investors, market feedback, or downstream funding processes, Results may not transfer to other high-stakes domains beyond VC screening (e.g., hiring, lending) without further validation

Claims (10)

ClaimDirectionConfidenceOutcomeDetails
The study used three leading models—GPT-4o, Claude 3.5 Sonnet, and DeepSeek-V2. Other null_result high models evaluated (methodological)
n=3
0.3
Each model evaluated 20 real startup pitch decks spanning multiple industries and funding stages. Other null_result high number of pitch decks evaluated (methodological)
n=20
0.18
To account for stochastic variation in outputs, each model pair was evaluated five times under identical conditions. Other null_result high number of repeated runs (methodological)
n=5
0.18
There are systematic and statistically significant differences across models in funding recommendations, evaluation scores, and expressed confidence. Decision Quality mixed medium funding recommendations; evaluation scores; expressed confidence
n=20
0.11
Reliability (stability across repeated runs) varies substantially across models, with ICC values ranging from 0.240 to 0.930. Decision Quality null_result high output reliability (ICC)
n=20
ICC values ranging from 0.240 to 0.930
0.18
GPT-4o can be characterized as a cautious allocator, combining relatively favorable evaluations with conservative funding decisions. Decision Quality mixed medium evaluation scores and funding recommendations
n=20
0.11
DeepSeek-V2 appears as a conservative scorer, applying more stringent and highly consistent evaluations while systematically underfunding. Decision Quality negative medium evaluation scores; funding recommendations; reliability
n=20
0.11
Claude 3.5 Sonnet aligns with a narrative funder profile, showing greater responsiveness to qualitative aspects of the pitch, somewhat higher funding levels, and strong cross-run reliability. Decision Quality positive medium responsiveness to qualitative aspects; funding levels; reliability
n=20
0.11
Differences between models are large enough to shape outcomes in practice, so reliability should be incorporated alongside average performance when assessing and deploying LLMs in high-stakes decision contexts. Governance And Regulation mixed high policy recommendation regarding assessment criteria (reliability + average performance)
0.03
Given the limited sample size, the results should be interpreted as exploratory. Other null_result high interpretation caveat regarding sample size
n=20
0.03

Notes