The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Organizations should evaluate humans and AI on shared capability 'profiles' rather than disparate metrics; doing so makes task allocation more predictive, explainable and auditable, but requires new measurement standards and infrastructure.

Reverse Turing Tests for Human-Machine Task Suitability Assessments Should be Profile-Driven
Jonathan Prunty, Marko Tešić, Ben Slater, Zachary Tidler, Paul Clothier, Luning Sun, Katherine Collins, Bernardo Gonçalves, Giulio Corsi, Seán Ó hÉigeartaigh, Lucy Cheke, Stephen Cave, Jose Hernandez-Orallo · May 20, 2026 · Open MIND
openalex theoretical n/a evidence 7/10 relevance DOI Source PDF
The paper argues that task-allocation between humans and AI should use profile-driven evaluations that infer latent capabilities and propensities from observed performance to place agents on shared, predictive, and auditable scales.

As AI is integrated into the workplace, organisations increasingly face allocation decisions between human and machine workers. These decisions are increasingly made or assisted by algorithms, creating a Reverse Turing Test dynamic wherein the machine is now the judge. In addition, human and machine workers may ``compete'' for a given task, reproducing aspects of adversarial games. This raises new methodological questions about assessing task suitability between humans and machines. The criteria often used to assess people (e.g., education, experience, references) cannot feasibly scale to AI systems; conversely, AI evaluation methods (benchmarks, red teaming, leaderboards) cannot be easily applied to human workers or yield comparable metrics. In this position paper, we argue that suitability evaluations for task-assignment should be profile-driven -- that is, based on assessments that infer latent constructs such as capabilities and propensities from observed performance. This approach places humans and AI systems on shared scales, supporting comparisons that are predictive of novel-task performance, explanatory of why agents succeed or fail, and auditable. We outline the core features of this approach, discuss its practical implications, and compare it with alternative frameworks for human-machine workplace allocation.

Summary

Main Finding

Profile-driven suitability evaluations — inferring latent constructs (capabilities, propensities) from observed performance and placing humans and AI systems on shared scales — are the most promising approach for task-allocation decisions as AI is integrated into workplaces. This approach enables comparisons that are predictive of novel-task performance, explanatory about why agents succeed or fail, and auditable, addressing scalability and comparability problems inherent in credential- or benchmark-driven evaluations.

Key Points

  • Reverse Turing Test dynamic: allocation algorithms increasingly judge whether a human or machine should perform a task, reversing traditional evaluation roles and creating adversarial/competitive interactions between human and AI workers.
  • Current evaluation mismatch:
    • Human assessments rely on credentials, references, and interviews that do not scale or translate to AI systems.
    • AI assessments rely on benchmarks, leaderboards, and red teaming that are not directly comparable to human measures.
  • Proposal: adopt profile-driven evaluations that infer latent traits (e.g., domain capabilities, reliability, bias propensity, error modes, adaptivity) from observed performance across tasks and contexts.
  • Desirable properties of profile-driven evaluation:
    • Shared scales across humans and machines for direct comparison.
    • Predictive validity for performance on novel or out-of-distribution tasks.
    • Explainability: profiles illuminate mechanisms and failure modes.
    • Auditability and traceability: tests and inferred constructs can be inspected, certified, and monitored over time.
  • Practical considerations:
    • Profiles should combine multi-task performance, stress tests, and behavior under adversarial or strategic conditions.
    • Measurement design must emphasize reliability, validity, and invariance across populations and system versions.
    • Profiles must be continuously updated to reflect learning, model updates, and human skill drift.
  • Tradeoffs and risks:
    • Gaming and strategic behavior by agents (or vendors) requires robust, dynamic testing and anti-gaming design.
    • Building shared metrics entails institutional coordination, standardization, and possibly regulation.
    • Equity and fairness concerns: profiles must account for distributional impacts and avoid embedding biased training signals.

Data & Methods

  • Conceptual approach: measurement theory / psychometrics extended to include AI systems.
  • Core statistical tools:
    • Item Response Theory (IRT) and latent-trait models to infer abilities/capacities from task-response patterns.
    • Factor analysis and structural equation models to identify and validate latent constructs (capability dimensions, propensities).
    • Bayesian hierarchical models to pool information across agents, tasks, and firms while quantifying uncertainty.
    • Measurement invariance testing to ensure scales are comparable across humans and AI systems, and across demographic or system subgroups.
    • Causal inference and transportability methods to estimate generalization to novel tasks and environments.
    • Experimental designs (A/B tests, multi-armed bandits) to evaluate allocation policies and feedback loops in live settings.
    • Stress-testing and adversarial evaluation to reveal failure modes and propensities under strategic conditions.
  • Data requirements:
    • Diverse, representative task sets sampled to probe multiple capability dimensions and contexts.
    • Longitudinal performance logs to capture adaptation, learning, and drift.
    • Metadata on task characteristics, cost, risk, and contextual constraints to enable conditional predictions.
    • Auditable records for provenance, versioning, and reproducibility.
  • Validation criteria:
    • Predictive validity: correlation of profile-inferred traits with out-of-sample or novel-task performance.
    • Construct validity: coherence with theory and convergent evidence from multiple tests.
    • Reliability: stability of trait estimates under repeated measurement.
    • Robustness: resistance to gaming, manipulation, and distribution shifts.

Implications for AI Economics

  • Labor allocation and substitution/complementarity:
    • Shared-profile metrics reduce frictions in matching tasks to agents, enabling more efficient allocation between human and machine labor and clearer assessment of substitution vs complementarity across task types.
    • Firms can better identify where AI augments human labor (complementary capabilities) versus displaces it (where AI profiles outperform humans).
  • Productivity measurement and accounting:
    • Profiles provide means to measure and compare “AI capital” to human capital on common axes, improving productivity accounting, valuation of AI investments, and ROI analyses.
  • Hiring, contracting, and compensation:
    • Performance- and profile-based contracting could replace credentials- or title-based hiring, shifting compensation toward demonstrable latent capabilities and propensities.
    • Reduces search and information costs for firms but introduces new costs for standardized testing infrastructure and ongoing monitoring.
  • Market structure and competition:
    • Standardized, auditable profiles could lower entry frictions for new AI providers by enabling transparent comparison, but they could also lead to consolidation if large firms control testing infrastructures or datasets.
  • Regulation and governance:
    • Regulators may need to set validation standards, disclosure requirements, and auditing processes for profile-driven evaluations to ensure fairness, safety, and accountability.
    • Profile-based audits can help enforce sector-specific constraints (e.g., high-reliability requirements in healthcare or finance).
  • Dynamics and strategic behavior:
    • Vendors and individuals will have incentives to optimize for profile metrics; the measurement system must be designed to reduce perverse incentives and to detect overfitting to benchmarks.
    • Continuous updating of profiles is necessary as models learn and human skills evolve, affecting labor demand trajectories and retraining needs.
  • Equity and distributional effects:
    • Profile-driven allocation can improve fairness by using task-relevant performance measures, but if underlying data or tests reflect societal biases, profiles may reproduce or amplify inequalities. Policy and design choices will matter for distributional outcomes.
  • Research and data infrastructure:
    • Economic research will require new datasets linking agent profiles to firm outcomes, wages, task allocation decisions, and macro-level productivity to quantify the broader impacts of profile-driven allocation.

Overall, adopting profile-driven suitability evaluations reshapes how firms, workers, and policymakers measure and allocate labor in the age of AI. It promises greater predictive power and comparability across human and machine agents but requires careful measurement design, governance, and infrastructure to manage incentives, fairness, and dynamic effects.

Assessment

Paper Typetheoretical Evidence Strengthn/a — This is a position/theoretical paper that proposes a framework but presents no empirical tests, experiments, or causal identification; arguments are conceptual and illustrative rather than evidence-based. Methods Rigormedium — The paper offers a coherent, well-structured conceptual framework and engages relevant literatures (AI evaluation, psychometrics, personnel assessment), and outlines practical implications and alternative approaches, but it lacks formal modeling, empirical specification, or validation studies. SampleNo empirical sample; a conceptual/position paper that synthesizes existing evaluation methods (benchmarks, leaderboards, hiring criteria) and literature to propose a profile-driven approach, using illustrative examples rather than original data. Themeshuman_ai_collab org_design adoption GeneralizabilityNo empirical validation across industries or tasks limits external validity, Relies on availability of comparable performance data for humans and AI, which varies by domain and firm, May not generalize to safety-critical or regulated contexts where different standards apply, Different AI architectures, task decompositions, and team structures could impede direct adoption, Legal, cultural, and organizational differences across jurisdictions and firms may constrain implementation

Claims (7)

ClaimDirectionConfidenceOutcomeDetails
As AI is integrated into the workplace, organisations increasingly face allocation decisions between human and machine workers, and these decisions are increasingly made or assisted by algorithms. Task Allocation positive high use of algorithms to make or assist allocation decisions between human and machine workers
0.06
The increased use of algorithms in allocation decisions creates a Reverse Turing Test dynamic wherein the machine is now the judge. Task Allocation negative high judgment role of algorithms in human-machine task assignment
0.02
Human and machine workers may 'compete' for a given task, reproducing aspects of adversarial games. Task Allocation negative high competitive interaction between human and AI workers for tasks
0.02
Common criteria used to assess people (e.g., education, experience, references) cannot feasibly scale to AI systems. Task Allocation negative high scalability of human assessment criteria to AI systems
0.06
AI evaluation methods (benchmarks, red teaming, leaderboards) cannot be easily applied to human workers or yield comparable metrics. Task Allocation negative high applicability and comparability of AI evaluation methods when applied to humans
0.06
Suitability evaluations for task-assignment should be profile-driven — based on assessments that infer latent constructs such as capabilities and propensities from observed performance. Task Allocation positive high method for conducting suitability evaluations (profile-driven assessment of latent capabilities/propensities)
0.02
A profile-driven approach places humans and AI systems on shared scales, supporting comparisons that are predictive of novel-task performance, explanatory of why agents succeed or fail, and auditable. Task Allocation positive high predictive validity for novel-task performance; explanatory power; auditability of comparisons between humans and AI
0.02

Notes