How researchers pay participants shapes what we learn about human–AI teamwork: inconsistent or misaligned incentives bias measured effort, trust and accuracy, so the authors propose a practical Incentive‑Tuning Framework to calibrate and transparently report pay schemes across studies.
AI has revolutionised decision-making across various fields. Yet human judgement remains paramount for high-stakes decision-making. This has fueled explorations of collaborative decision-making between humans and AI systems, aiming to leverage the strengths of both. To explore this dynamic, researchers conduct empirical studies, investigating how humans use AI assistance for decision-making and how this collaboration impacts results. A critical aspect of conducting these studies is the role of participants, often recruited through crowdsourcing platforms. The validity of these studies hinges on the behaviours of the participants, hence effective incentives that can potentially affect these behaviours are a key part of designing and executing these studies. In this work, we aim to address the critical role of incentive design for conducting empirical human-AI decision-making studies, focusing on understanding, designing, and documenting incentive schemes. Through a thematic review of existing research, we explored the current practices, challenges, and opportunities associated with incentive design for human-AI decision-making empirical studies. We identified recurring patterns, or themes, such as what comprises the components of an incentive scheme, how incentive schemes are manipulated by researchers, and the impact they can have on research outcomes. Leveraging the acquired understanding, we curated a set of guidelines to aid researchers in designing effective incentive schemes for their studies, called the Incentive-Tuning Framework, outlining how researchers can undertake, reflect on, and document the incentive design process. By advocating for a standardised yet flexible approach to incentive design and contributing valuable insights along with practical tools, we hope to pave the way for more reliable and generalizable knowledge in the field of human-AI decision-making.
Summary
Main Finding
A thematic review of empirical human–AI decision-making studies shows that incentive design critically shapes participant behaviour and thus the validity and generalizability of results. The authors synthesize recurring themes in how incentives are constructed and manipulated, demonstrate their effects on outcomes (effort, reliance on AI, accuracy), and propose the Incentive-Tuning Framework: a practical, standardised-yet-flexible guideline to design, calibrate, and document incentive schemes for human–AI studies.
Key Points
- Motivation
- Human judgement remains essential for high-stakes decisions; researchers study human–AI collaboration to leverage complementary strengths.
- Participant behaviour on crowdsourcing platforms strongly affects empirical findings, so incentive schemes are a central design lever.
- What makes up an incentive scheme
- Monetary payments (base pay, performance bonuses, penalties).
- Non-monetary motivators (feedback, reputation, gamification, certification).
- Task framing, timing, and feedback frequency (how outcomes are communicated).
- Alignment between reward structure and the target behaviour (accuracy, speed, risk preferences, carefulness).
- How researchers manipulate incentives
- Varying stake size and bonus schedules.
- Framing rewards as individual vs. group/competition.
- Introducing explicit costs for errors or rewards for agreement with AI.
- Using post-hoc performance bonuses vs. immediate feedback-based incentives.
- Impact on study outcomes
- Incentives influence effort, attention, strategic reporting, risk-taking, and willingness to rely on AI advice.
- Poorly aligned incentives can produce biased estimates of human–AI complementarity (over/under-reliance, miscalibrated trust).
- Heterogeneous incentive effects across populations (crowdworkers vs. experts) reduce external validity if not accounted for.
- Proposed remedy
- Incentive-Tuning Framework: a set of steps and documentation practices to diagnose, design, pilot, calibrate, and report incentive schemes to improve internal and external validity.
- Encourages standardized reporting so results are comparable and replicable.
Data & Methods
- Approach
- The paper conducts a thematic (qualitative) review of existing empirical human–AI decision-making studies with a focus on how incentives are designed and reported.
- The authors extract recurring patterns/themes across studies, note manipulations and consequences, and synthesize practical guidance.
- Outputs
- Identification of key incentive components and common manipulations.
- A curated guideline (Incentive-Tuning Framework) detailing steps for designing, piloting, and documenting incentive schemes.
- Limitations (reported / implicit)
- The review is thematic and qualitative rather than a meta-analysis with pooled quantitative effect sizes.
- The framework is synthesis-driven and may require empirical validation across more tasks, populations, and domains.
Implications for AI Economics
- Validity of economic estimates
- Incentive design affects measured behaviour (effort, risk aversion, trust), which in turn biases estimates of key economic parameters (e.g., willingness to adopt AI, productivity gains, error rates). Careful incentive design is necessary to produce reliable inputs for cost–benefit and welfare analyses.
- External and policy relevance
- Standardised reporting and calibrated incentives enhance comparability across studies, improving the evidence base for policy, regulation, and procurement decisions involving AI-assisted decision-making.
- Experimental and market design
- For mechanism and market designers, understanding how incentives interact with AI advice helps predict strategic responses, design contracts, and set optimal compensation structures in AI-augmented workplaces.
- Research practice
- Economists running lab or online experiments should (a) explicitly align incentives with target outcomes, (b) pilot-stake levels to avoid under/over-incentivisation, (c) report incentive components transparently, and (d) consider population heterogeneity when generalising results.
- Future directions
- Empirical validation of the Incentive-Tuning Framework across domains (medical, financial, legal) and participant pools to quantify how different schemes shift measured AI complementarities and welfare outcomes.
- Incorporation of incentive effects into models of technology adoption and labor market impacts of AI.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| AI has revolutionised decision-making across various fields. Decision Quality | positive | medium | degree/extent of AI adoption and impact on decision-making processes (general, literature-level) |
Literature-level claim that AI has transformed decision-making across fields
0.14
|
| Human judgement remains paramount for high-stakes decision-making. Decision Quality | positive | medium | reliance on human judgement in high-stakes decisions (conceptual/literature-level) |
Conceptual claim: human judgement remains paramount for high-stakes decisions
0.14
|
| Researchers conduct empirical studies investigating how humans use AI assistance for decision-making and how this collaboration impacts results. Decision Quality | neutral | high | human behaviour and decision outcomes when assisted by AI (empirical study outcomes) |
Statement about empirical research into human use of AI assistance and impacts on decisions
0.24
|
| A critical aspect of conducting human–AI decision-making studies is the role of participants, often recruited through crowdsourcing platforms. Research Productivity | neutral | high | participant recruitment source (e.g., crowdsourcing) and its influence on study validity/behaviour |
Observation: participants in human–AI studies are often recruited via crowdsourcing platforms
0.24
|
| The validity of human–AI decision-making studies hinges on participants' behaviours; effective incentives can potentially affect these behaviours. Research Productivity | mixed | high | participant behaviour (engagement, effort, strategy) and resulting study validity/measurement quality |
Argument: participant behaviour affects study validity and can be influenced by incentives
0.24
|
| Through a thematic review of existing research, the authors identified recurring themes about incentive schemes: their components, how researchers manipulate them, and their impact on research outcomes. Research Productivity | neutral | high | themes in incentive design practices and reported impacts on empirical study outcomes |
Thematic review identifying recurring themes about incentive schemes and their impact on outcomes
0.24
|
| The authors curated a set of guidelines called the Incentive-Tuning Framework to aid researchers in designing effective incentive schemes for human–AI decision-making studies. Research Productivity | positive | high | guidance for incentive design (qualitative artifact intended to influence study design quality) |
Creation of the Incentive-Tuning Framework to guide incentive design in human–AI decision-making studies
0.24
|
| Adopting a standardised yet flexible approach to incentive design can help produce more reliable and generalizable knowledge in human–AI decision-making research. Research Productivity | positive | medium | reliability and generalizability of findings from human–AI decision-making studies |
Claim that standardized, flexible incentive design improves reliability and generalizability of research findings
0.14
|