Organizations should evaluate humans and AI on shared capability 'profiles' rather than disparate metrics; doing so makes task allocation more predictive, explainable and auditable, but requires new measurement standards and infrastructure.

Reverse Turing Tests for Human-Machine Task Suitability Assessments Should be Profile-Driven

Jonathan Prunty, Marko Tešić, Ben Slater, Zachary Tidler, Paul Clothier, Luning Sun, Katherine Collins, Bernardo Gonçalves, Giulio Corsi, Seán Ó hÉigeartaigh, Lucy Cheke, Stephen Cave, Jose Hernandez-Orallo · May 20, 2026 · Open MIND

openalex theoretical n/a evidence 7/10 relevance Full text usable extracted full text DOI Source PDF

The paper argues that task-allocation between humans and AI should use profile-driven evaluations that infer latent capabilities and propensities from observed performance to place agents on shared, predictive, and auditable scales.

Citation observations

Cumulative provider counts captured on specific dates; providers are never combined.

0 cumulative citations

OpenAlex · Observed July 22, 2026

View corpus context

As AI is integrated into the workplace, organisations increasingly face allocation decisions between human and machine workers. These decisions are increasingly made or assisted by algorithms, creating a Reverse Turing Test dynamic wherein the machine is now the judge. In addition, human and machine workers may ``compete'' for a given task, reproducing aspects of adversarial games. This raises new methodological questions about assessing task suitability between humans and machines. The criteria often used to assess people (e.g., education, experience, references) cannot feasibly scale to AI systems; conversely, AI evaluation methods (benchmarks, red teaming, leaderboards) cannot be easily applied to human workers or yield comparable metrics. In this position paper, we argue that suitability evaluations for task-assignment should be profile-driven -- that is, based on assessments that infer latent constructs such as capabilities and propensities from observed performance. This approach places humans and AI systems on shared scales, supporting comparisons that are predictive of novel-task performance, explanatory of why agents succeed or fail, and auditable. We outline the core features of this approach, discuss its practical implications, and compare it with alternative frameworks for human-machine workplace allocation.

Summary

Main Finding

The paper argues that machine-mediated task-assignment decisions between humans and AI — a Reverse Turing Test (RTT) dynamic where machines often act as judges — should be based on profile-driven evaluations. That is, tasks and agents (human, machine, or hybrid) should be represented on shared latent dimensions (capabilities and propensities) inferred from item-level performance. Profile-driven evaluation yields commensurable, predictive, explanatory, and auditable suitability judgments that reduce superficial gaming and bias relative to surface metrics or task-specific benchmarks.

Key Points

Reverse Turing Test dynamic: as AI both performs work and judges worker suitability, allocation decisions increasingly pit humans and machines against automated evaluators, creating imitation/adversarial incentives.
Misalignment of current evaluation methods: human assessment (CVs, experience) and AI assessment (benchmarks, leaderboards) are not directly comparable and generalise poorly to novel or mixed tasks.
Profile-driven approach: represent tasks and agents using latent constructs:
- Capabilities (e.g., planning, language, numerical reasoning) — typically monotonic relationships with task success.
- Propensities (e.g., risk aversion, conscientiousness) — often non-monotonic, with optimal ranges.
Mechanism: infer these constructs from item-level performance annotated by the cognitive/functional demands of items, then match agent profile to task demand profile via transparent rule-based matching.
Advantages:
- Commensurability: places humans and AI on shared functional scales without requiring ontological equivalence.
- Predictive validity: generalises to out-of-distribution and novel tasks because it models demand-sensitivity rather than task-specific attainment.
- Explanatory and auditable: makes explicit why agents succeed or fail and supports human oversight and contestability.
- Resistance to gaming: latent constructs are harder to mimic than surface behaviors; mimicking would require consistent, dimension-specific performance across diverse items.
Practical caveats:
- Suitability ≠ assignment: legal, ethical, economic, and organisational constraints may override suitability.
- Oversight: humans must retain responsibility for profile generation, validation, and constraint-setting, especially for high-stakes decisions.
- Profile updating: both humans and AI evolve; methods are needed for reassessment, versioning, and propagation across hierarchical constructs.
- Bias risk: profile definitions and annotations can embed bias; governance and standards are necessary.

Data & Methods (as used / proposed in the paper)

Nature of the work: position/conceptual paper synthesising prior empirical and theoretical work; no new large-scale dataset introduced.
Empirical supports cited:
- Studies showing process-level behavioral signatures distinguish humans from AI even when aggregate performance is matched.
- Work demonstrating that mapping model profiles to task demand profiles (via annotated item-level demands) improves prediction of performance on novel benchmarks and tasks.
Proposed methodological components:
- Item-level demand annotation: annotate benchmark items or task instances with the latent cognitive/functional demands they impose.
- Latent-construct inference: use item-response / factor-analytic / demand-sensitivity models to infer agent profiles from performance across a heterogeneous item battery.
- Profile-to-profile matching: transparent rule-based or interpretable scoring functions to compare agent capabilities/propensities with task demand profiles and estimate expected task performance.
- Validation: expert-validated task profiles, automated annotators, and professional guidelines can be used to create/validate task demand profiles.
Implementation examples (illustrative only): long-term hiring (hire and train human to fill capability gaps), dynamic dispatch (real-time routing of service calls to human/AI), hybrid workflows (task decomposition across agent types such as CNN + radiologist + nurse).
Gaps and research needs identified: standardised cross-agent profile generation, criteria for reassessment frequency, hierarchical propagation of skill improvements, and mitigation of annotation bias.

Implications for AI Economics

Labor allocation and comparative advantage:
- Profile-driven matching alters how comparative advantage is measured: analyses should consider latent capability/demand alignment, not just observed output or time-series task completion.
- Platform allocation decisions will increasingly route tasks to the agent with the best profile fit, changing labor demand composition across occupations and tasks.
Measurement of automation and displacement:
- Traditional automation metrics (tasks routine/non-routine, occupation-level shares) may mis-measure substitutability; latent-profile measures enable finer-grained assessment of which specific capabilities/tasks are automatable or complementary.
- Welfare/redistribution analyses should account for heterogeneous propensities and optimal ranges (some roles require human propensities that are hard to replicate).
Incentives, Goodharting, and strategic behavior:
- Profile-based systems reduce but do not eliminate Goodhart effects; firms/workers will still optimize observable behaviors tied to latent estimates. Economists should model dynamic strategic responses (investment in skills, use of assistive AI, sandbagging).
- Access to particular profiling tools/models may create new rent-seeking and self-preferencing dynamics (platform bias toward its own AI outputs).
Wage and human capital dynamics:
- Employers can target training to specific latent capability gaps, changing returns to specific forms of training and credentialing.
- Compensation may shift to reward maintenance of propensities or multidimensional capability portfolios that are scarce relative to task demand distributions.
Platform and market design:
- Platforms should incorporate auditable profile-matching and human oversight mechanisms to reduce unfairness and enable contestability—policy/regulation could mandate standards for profile transparency and reassessment.
- Dynamic pricing/routing algorithms will need to internalise externalities from misallocation (errors, safety risks) and legal/ethical constraints that override pure suitability.
Empirical research agenda and data needs:
- Construct datasets linking item-level task demands, agent performance (humans and models), and downstream outcomes (error costs, customer satisfaction).
- Natural experiments: platform rollouts that switch routing logic to profile-driven matching can identify causal effects on productivity, wages, and task outcomes.
- Structural models: estimate latent capability distributions, returns to capability investments, and equilibrium allocation when platforms optimize across cost, risk, and fairness constraints.
- Identification challenges: selection into tasks, endogenous profile improvement (learning-by-doing, model updates), and covert hybridization (humans using AI or AI outsourcing subtasks) must be addressed.
Policy implications:
- Regulation could require auditable profile frameworks for high-stakes allocation (hiring, medical triage), standards for item-demand annotation, and rules for human oversight.
- Labor-market policy should consider re-skilling subsidies targeted at latent capability gaps identified by profile-based systems, and protections against employer self-preferencing.
Macroeconomic and productivity accounting:
- Productivity measures should adjust for improved matching due to profile-based allocation; measured output gains may reflect better allocation rather than pure automation.
- Distributional effects: improved matching can increase aggregate efficiency but may concentrate gains among owners of AI and platform firms unless countervailing policies are enacted.

Suggested priorities for economists: - Build and curate item-level annotated benchmark banks that are agent-agnostic. - Estimate how much profile-driven routing changes marginal returns to different skills and alters wage distributions. - Evaluate regulatory interventions (transparency mandates, oversight requirements) by modeling trade-offs between efficiency, fairness, and robustness to gaming.

Overall, the profile-driven framework reshapes how economists should think about human–AI substitution and complementarity: move from coarse task counts toward latent-capability-based measurement and policy design that acknowledges dynamics of adaptation, incentives, and platform governance.

Assessment

Paper Typetheoretical Evidence Strengthn/a — This is a position/theoretical paper that proposes a framework but presents no empirical tests, experiments, or causal identification; arguments are conceptual and illustrative rather than evidence-based. Methods Rigormedium — The paper offers a coherent, well-structured conceptual framework and engages relevant literatures (AI evaluation, psychometrics, personnel assessment), and outlines practical implications and alternative approaches, but it lacks formal modeling, empirical specification, or validation studies. SampleNo empirical sample; a conceptual/position paper that synthesizes existing evaluation methods (benchmarks, leaderboards, hiring criteria) and literature to propose a profile-driven approach, using illustrative examples rather than original data. Themeshuman_ai_collab org_design adoption GeneralizabilityNo empirical validation across industries or tasks limits external validity, Relies on availability of comparable performance data for humans and AI, which varies by domain and firm, May not generalize to safety-critical or regulated contexts where different standards apply, Different AI architectures, task decompositions, and team structures could impede direct adoption, Legal, cultural, and organizational differences across jurisdictions and firms may constrain implementation

Claims (7)

Claim	Direction	Outcome	Confidence & Evidence	Details
As AI is integrated into the workplace, organisations increasingly face allocation decisions between human and machine workers, and these decisions are increasingly made or assisted by algorithms. Task Allocation	positive	use of algorithms to make or assist allocation decisions between human and machine workers	Reading fidelity high Study strength low	not reported 0.06
The increased use of algorithms in allocation decisions creates a Reverse Turing Test dynamic wherein the machine is now the judge. Task Allocation	negative	judgment role of algorithms in human-machine task assignment	Reading fidelity high Study strength speculative	not reported 0.02
Human and machine workers may 'compete' for a given task, reproducing aspects of adversarial games. Task Allocation	negative	competitive interaction between human and AI workers for tasks	Reading fidelity high Study strength speculative	not reported 0.02
Common criteria used to assess people (e.g., education, experience, references) cannot feasibly scale to AI systems. Task Allocation	negative	scalability of human assessment criteria to AI systems	Reading fidelity high Study strength low	not reported 0.06
AI evaluation methods (benchmarks, red teaming, leaderboards) cannot be easily applied to human workers or yield comparable metrics. Task Allocation	negative	applicability and comparability of AI evaluation methods when applied to humans	Reading fidelity high Study strength low	not reported 0.06
Suitability evaluations for task-assignment should be profile-driven — based on assessments that infer latent constructs such as capabilities and propensities from observed performance. Task Allocation	positive	method for conducting suitability evaluations (profile-driven assessment of latent capabilities/propensities)	Reading fidelity high Study strength speculative	not reported 0.02
A profile-driven approach places humans and AI systems on shared scales, supporting comparisons that are predictive of novel-task performance, explanatory of why agents succeed or fail, and auditable. Task Allocation	positive	predictive validity for novel-task performance; explanatory power; auditability of comparisons between humans and AI	Reading fidelity high Study strength speculative	not reported 0.02