Evaluation should move beyond model accuracy to measure whether human-AI teams are ready to collaborate: the paper offers a four-part taxonomy and trace-based metrics to assess outcomes, reliance, safety signals and learning so organizations can better calibrate, recover from errors and govern AI-assisted decisions.
Artificial intelligence (AI) systems are deployed as collaborators in human decision-making. Yet, evaluation practices focus primarily on model accuracy rather than whether human-AI teams are prepared to collaborate safely and effectively. Empirical evidence shows that many failures arise from miscalibrated reliance, including overuse when AI is wrong and underuse when it is helpful. This paper proposes a measurement framework for evaluating human-AI decision-making centered on team readiness. We introduce a four part taxonomy of evaluation metrics spanning outcomes, reliance behavior, safety signals, and learning over time, and connect these metrics to the Understand-Control-Improve (U-C-I) lifecycle of human-AI onboarding and collaboration. By operationalizing evaluation through interaction traces rather than model properties or self-reported trust, our framework enables deployment-relevant assessment of calibration, error recovery, and governance. We aim to support more comparable benchmarks and cumulative research on human-AI readiness, advancing safer and more accountable human-AI collaboration.
Summary
Main Finding
The paper argues that evaluating AI for deployment should shift from model-centered accuracy metrics to team-centered measures of "readiness" for safe, effective human-AI collaboration. It proposes a measurement framework — operationalized from interaction traces — with a four-part taxonomy (outcomes, reliance behavior, safety signals, learning over time) tied to an Understand–Control–Improve (U‑C‑I) onboarding lifecycle. This framework enables deployment-relevant assessment of calibration, error recovery, and governance, supporting more comparable benchmarks and cumulative research on human-AI readiness.
Key Points
- Problem: Traditional evaluations emphasize standalone model accuracy and self‑reported trust, missing real-world failures that arise from miscalibrated reliance (overreliance when AI is wrong, underuse when it is helpful).
- Core proposal: A measurement framework centered on team readiness rather than model properties, using interaction traces (logs of human actions, AI outputs, timing) as primary data.
- Four-part taxonomy of metrics:
- Outcomes: team-level performance (accuracy, utility, cost), distributional effects, fairness, and downstream harms.
- Reliance behavior: delegation/override rates, conditional reliance given AI correctness/confidence, calibration of human reliance to model competence, false accept/reject rates.
- Safety signals: near misses, error recovery rates, time-to-correction, escalation/fallback usage, incidence of high-severity mistakes.
- Learning over time: changes in human reliance and skill (adaptation rates), retention of corrective behaviors, drift in team performance across longitudinal deployment.
- Lifecycle alignment: Metrics map to the Understand–Control–Improve (U‑C‑I) onboarding stages:
- Understand: diagnose current human-AI interaction patterns and failure modes.
- Control: set governance, decision thresholds, escalation rules, and interface affordances to manage behavior.
- Improve: evaluate training, interface changes, model updates via longitudinal metrics.
- Measurement emphasis: Prefer behavioral traces over self-reported trust; focus on actionable signals (when humans misattribute errors, how quickly they recover, how reliance responds to model confidence).
- Goal: Enable deployment-relevant benchmarks, reproducible comparisons across studies, and cumulative progress on safe collaboration.
Data & Methods
- Data sources: interaction traces (time-stamped logs of AI outputs, human actions, overrides, queries, corrections), task metadata, outcome labels, and contextual variables (user expertise, workload).
- Experimental designs recommended:
- Randomized trials (e.g., assist vs. no-assist, interface A vs. B).
- Within-subject and between-subject studies, including counterbalanced designs.
- Longitudinal field studies to capture learning, drift, and adaptation.
- Operationalized metrics (examples and computations):
- Calibration of reliance: P(human accepts AI | AI correct) vs. P(human accepts AI | AI incorrect); calibration gap measures misaligned reliance.
- Conditional dependence: reliance rate conditional on reported model confidence or uncertainty.
- Recovery metrics: time-to-detection, time-to-correction, proportion of errors corrected before downstream harm.
- Delegation dynamics: frequency and timing of delegations, override latency, escalation path usage.
- Team utility: combined cost/benefit functions (e.g., correct decisions minus cost of errors, time savings).
- Learning curves: mixed-effects or hierarchical models to estimate change in individual reliance over sessions; survival analysis for persistence of corrective behavior.
- Analysis methods:
- Causal inference techniques to estimate intervention effects (instrumental variables, randomized assignment).
- Mixed‑effects models for longitudinal and clustered data.
- Calibration curves and Brier scores adapted to human acceptance outcomes.
- Policy evaluation metrics for governance rules (false positive/negative tradeoffs in delegation thresholds).
- Practical recommendations for data collection:
- Log granular actions and timestamps, capture AI confidence/uncertainty signals, annotate ground-truth outcomes, record user context and expertise.
- Avoid relying solely on self-report measures of trust; treat them as complements to behavioral traces.
Implications for AI Economics
- Valuation and procurement: Firms and buyers should value AI systems not only by model accuracy but by their measured impact on team readiness and productivity (e.g., net utility, error-recovery savings). Procurement decisions ought to include readiness metrics and longitudinal performance guarantees.
- Labor substitutability and complementarities: Readiness metrics quantify how AI changes worker productivity and error rates over time, informing models of task allocation, complementarity, and wage effects. Miscalibration costs (errors, recovery time) affect net gains from automation.
- Investment in onboarding and training: The framework highlights measurable returns to investments in human onboarding (Understand–Control–Improve). Economists can model optimal training investments, pricing of AI systems bundled with human training, and tradeoffs between model improvement vs. human training.
- Adoption dynamics and market structure: Standardized readiness benchmarks lower information frictions, enabling more comparable competition across vendors and better matching between AI products and firm needs. Transparent readiness metrics can accelerate adoption in sectors with high safety requirements (healthcare, finance) by reducing uncertainty about downstream risks.
- Regulation, liability, and insurance: Regulators can require readiness evaluations as part of certification for high-risk AI deployments. Insurers can price premiums based on measured safety signals (near-miss rates, recovery performance). Contract design (warranties, SLAs) can incorporate team-level metrics.
- Productivity measurement and public policy: National statistics and policy analysis should account for human-AI team productivity (not model-only metrics). Readiness measures can inform workforce planning, reskilling programs, and social welfare analyses that account for harms from miscalibration.
- Research and benchmarking economics: Economists and researchers can use the framework to build standardized datasets and benchmarks that reflect deployment realities, enabling cost–benefit studies, externalities assessment, and policy simulations.
- Practical actions for firms/policymakers:
- Require interaction logging and routine calculation of readiness metrics for deployed AI.
- Use U‑C‑I lifecycle to structure onboarding budgets and governance policies.
- Include readiness outcomes in procurement, regulation, and insurance criteria.
Note: The framework shifts evaluation toward the behavioral and institutional context of AI deployment. For economic models and policy, incorporating these team‑level, longitudinal metrics produces more accurate estimates of benefits, costs, and risks from AI adoption than model accuracy alone.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Artificial intelligence (AI) systems are deployed as collaborators in human decision-making. Adoption Rate | positive | high | deployment of AI as collaborators |
0.06
|
| Evaluation practices focus primarily on model accuracy rather than whether human-AI teams are prepared to collaborate safely and effectively. Governance And Regulation | negative | high | evaluation focus (accuracy vs. team readiness) |
0.06
|
| Empirical evidence shows that many failures arise from miscalibrated reliance, including overuse when AI is wrong and underuse when it is helpful. Error Rate | negative | high | failures due to miscalibrated reliance (overreliance/underreliance) |
0.12
|
| This paper proposes a measurement framework for evaluating human-AI decision-making centered on team readiness. Governance And Regulation | positive | high | team readiness evaluation |
0.02
|
| We introduce a four part taxonomy of evaluation metrics spanning outcomes, reliance behavior, safety signals, and learning over time. Governance And Regulation | positive | high | evaluation metrics taxonomy (outcomes, reliance behavior, safety signals, learning) |
0.02
|
| The taxonomy and metrics are connected to the Understand-Control-Improve (U-C-I) lifecycle of human-AI onboarding and collaboration. Training Effectiveness | positive | high | linking metrics to U-C-I onboarding lifecycle |
0.02
|
| Operationalizing evaluation through interaction traces rather than model properties or self-reported trust enables deployment-relevant assessment of calibration, error recovery, and governance. Error Rate | positive | high | assessment of calibration, error recovery, governance via interaction traces |
0.02
|
| The framework aims to support more comparable benchmarks and cumulative research on human-AI readiness, advancing safer and more accountable human-AI collaboration. Ai Safety And Ethics | positive | high | benchmarks, cumulative research, safety and accountability in human-AI collaboration |
0.02
|