The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

State-of-the-art LLMs sharply boost human task performance—GPT‑4o increases accuracy by ~29 percentage points and Llama‑3.1‑8B by ~23—yet these gains hinge on a distinct collaborative ability rooted in perspective-taking, not just raw problem‑solving skill.

Quantifying and Optimizing Human-AI Synergy: Evidence-Based Strategies for Adaptive Collaboration
Jonathan H. Westover · Fetched March 17, 2026 · Human Capital Leadership Review
semantic_scholar correlational medium evidence 8/10 relevance DOI Source
Using a Bayesian IRT framework on benchmark data (n=667), the paper finds large human–AI synergy—GPT‑4o and Llama‑3.1‑8B raise human accuracy by about 29 and 23 percentage points respectively—and shows collaborative ability (distinct from individual skill) and Theory of Mind predict those gains.

The emergence of large language models (LLMs) has transformed human-machine interaction, yet evaluation frameworks remain predominantly model-centric, focusing on standalone AI performance rather than emergent collaborative outcomes. This article introduces a novel Bayesian Item Response Theory framework that quantifies human–AI synergy by separately estimating individual ability, collaborative ability, and AI model capability while controlling for task difficulty. Analysis of benchmark data (n=667) reveals substantial synergy effects, with GPT-4o improving human performance by 29 percentage points and Llama-3.1-8B by 23 percentage points. Critically, collaborative ability proves distinct from individual problem-solving ability, with Theory of Mind—the capacity to infer and adapt to others' mental states—emerging as a key predictor of synergy. Both stable individual differences and moment-to-moment fluctuations in perspective-taking influence AI response quality, highlighting the dynamic nature of effective human-AI interaction. Organizations can leverage these insights to design training programs, selection criteria, and AI systems that prioritize emergent team performance over standalone capabilities, marking a fundamental shift toward optimizing collective intelligence in human-AI teams.

Summary

Main Finding

A new Bayesian Item Response Theory (IRT) framework shows that human–AI collaboration produces large, measurable synergy effects that are distinct from individual ability and AI capability. In benchmark data (n = 667), GPT-4o increased human task performance by 29 percentage points and Llama-3.1-8B by 23 percentage points. Theory of Mind (perspective-taking) — both as a stable trait and a momentary state — is a key predictor of how much an individual benefits from AI, implying that emergent team performance is not reducible to standalone human or model metrics.

Key Points

  • Framework: Introduces a Bayesian IRT model that separately estimates (a) individual human ability, (b) collaborative ability (human skill at leveraging AI), and (c) AI model capability, while controlling for task difficulty.
  • Synergy magnitude: Substantial positive collaboration gains observed across benchmarks; GPT-4o and Llama-3.1-8B produced +29 and +23 percentage point improvements in human performance, respectively.
  • Distinct constructs: Collaborative ability is empirically distinct from individual problem-solving ability — high problem solvers are not necessarily those who get the biggest lift from AI.
  • Theory of Mind (ToM): The capacity to infer and adapt to others’ mental states predicts collaborative gains. Both stable individual differences in ToM and within-person, moment-to-moment fluctuations in perspective-taking affect the quality of AI-augmented responses.
  • Dynamic interaction: Human–AI performance is dynamic rather than fixed; transient cognitive states matter for the success of human–AI teams.
  • Practical takeaway: Optimizing human–AI teams requires interventions beyond improving standalone LLM accuracy — e.g., training, selection, and system design that prioritize emergent team performance.

Data & Methods

  • Sample: Benchmark dataset with n = 667 human–AI interactions (details of tasks/participants not specified in summary).
  • Methodology:
    • Bayesian Item Response Theory extension that jointly models:
      • Task/item difficulty,
      • Human individual ability (baseline problem-solving),
      • Collaborative ability (propensity/effectiveness of leveraging AI),
      • AI model capability (separate parameter for each model).
    • Controls for task heterogeneity to isolate collaboration effects.
    • Incorporates measures of Theory of Mind and captures both between-person (stable) and within-person (momentary) variation.
  • Models evaluated: At least GPT-4o and Llama-3.1-8B were included; reported collaboration effect sizes are model-specific.
  • Outcomes: Percentage-point improvements in task performance when humans collaborate with LLMs; predictive analyses linking ToM measures to collaborative gains.

Implications for AI Economics

  • Complementarities and productivity:
    • Human–AI complementarities matter: value created by AI depends on users’ collaborative ability and momentary cognitive states, not just model accuracy.
    • Economic gains from AI adoption may be under- or over-estimated if analyses use model-centric metrics alone.
  • Returns to training and selection:
    • Investing in worker training that builds perspective-taking and collaborative skills may yield high returns by increasing AI complementarities.
    • Hiring or assigning workers based on collaborative ability could improve organizational returns to AI investments.
  • Measurement and valuation:
    • Procurement and cost–benefit analyses should incorporate collaborative ability and expected synergy, not just benchmark standalone model performance.
    • Compensation and performance metrics might need redesign to capture emergent team output.
  • Dynamic labor effects:
    • Short-run productivity fluctuates with workers’ cognitive states; scheduling, interface design, and nudges that stabilize beneficial states could raise output.
    • Heterogeneous gains imply distributional labor impacts: some workers may benefit far more, affecting wage dispersion and task allocation.
  • Product and policy design:
    • Product teams should design interfaces and prompts that scaffold users’ Theory of Mind and perspective-taking to maximize synergy.
    • Regulators and policymakers evaluating AI impacts should consider team-level effects (training externalities, complementarity-driven displacement, productivity multipliers).
  • Research and evaluation:
    • Future economic studies should use team-centric evaluation frameworks (like the proposed Bayesian IRT) to estimate true returns to AI and to model adoption equilibria and labor reallocation.

(If you want, I can expand with a short toy model showing how collaborative ability modifies firm-level returns to AI investment, or provide suggested metrics for measuring collaborative ability in field settings.)

Assessment

Paper Typecorrelational Evidence Strengthmedium — The paper uses a novel Bayesian IRT to estimate separate components (individual ability, collaborative ability, model capability) and reports large, consistent performance gains on benchmark tasks with a reasonably large sample (n=667). However, findings are based on observational benchmark interactions rather than randomized field interventions, limiting causal claims and real-world external validity; results may depend on task selection, participant sampling, and particular model versions. Methods Rigorhigh — The methodological contribution is strong: a Bayesian Item Response Theory specification that jointly estimates item difficulty, individual ability, collaborative ability, and model capability addresses measurement confounding common in model-centric evaluations; the analysis appears to leverage appropriate probabilistic estimation and controls. Remaining concerns include potential measurement error in psychological predictors (e.g., Theory of Mind), unobserved confounders, and assumptions embedded in the IRT parameterization. SampleBenchmark dataset of 667 human–AI interactions across multiple standardized tasks/questions; humans completed problems both alone and in collaboration with multiple LLMs (reported results for GPT-4o and Llama‑3.1‑8B); task difficulty calibrated within the IRT model; sample includes individual-level psychometric measures (e.g., Theory of Mind/perspective-taking) and repeated trials to capture moment-to-moment fluctuation. Themeshuman_ai_collab productivity skills_training GeneralizabilityBenchmark tasks may not reflect complex, open-ended real-world workplace tasks, Participant pool/sample composition not described in detail — may not represent broader worker populations (e.g., age, occupation, education, cultural background), Results depend on specific model versions (GPT-4o, Llama-3.1-8B) and will change as models evolve, Short-term, task-level interactions may not generalize to long-run team performance or organizational settings, Psychometric measures (Theory of Mind) and their operationalization may not transfer across contexts or languages

Claims (8)

ClaimDirectionConfidenceOutcomeDetails
The article introduces a novel Bayesian Item Response Theory framework that quantifies human–AI synergy by separately estimating individual ability, collaborative ability, and AI model capability while controlling for task difficulty. Decision Quality null_result high estimated parameters for individual ability, collaborative ability, AI model capability (model-derived measures)
0.3
Analysis of benchmark data (n = 667) reveals substantial synergy effects: GPT-4o improves human performance by 29 percentage points. Decision Quality positive high human task performance (accuracy, measured in percentage points) when assisted by GPT-4o versus unassisted
n=667
0.3
Analysis of benchmark data (n = 667) reveals substantial synergy effects: Llama-3.1-8B improves human performance by 23 percentage points. Decision Quality positive high human task performance (accuracy, measured in percentage points) when assisted by Llama-3.1-8B versus unassisted
n=667
0.3
Collaborative ability is distinct from individual problem-solving ability. Decision Quality null_result medium separability/distinctness of model parameters for collaborative ability versus individual problem-solving ability (latent ability estimates)
n=667
0.18
Theory of Mind (the capacity to infer and adapt to others' mental states) emerges as a key predictor of synergy. Decision Quality positive medium synergy (performance improvement with AI assistance) predicted by Theory of Mind scores
n=667
0.18
Both stable individual differences and moment-to-moment fluctuations in perspective-taking influence AI response quality. Decision Quality positive medium AI response quality (as rated or measured) as a function of trait and state perspective-taking measures
n=667
0.18
Evaluation frameworks remain predominantly model-centric, focusing on standalone AI performance rather than emergent collaborative outcomes. Other negative medium focus of existing evaluation frameworks (model-centric emphasis versus collaborative outcome measurement)
0.18
Organizations can leverage these insights to design training programs, selection criteria, and AI systems that prioritize emergent team performance over standalone capabilities, marking a shift toward optimizing collective intelligence in human-AI teams. Team Performance positive speculative organizational practices (training, selection, system design) and expected impact on collective human-AI team performance (proposed, not directly measured in the study)
0.03

Notes