A new mathematical framework shows human–AI teams can achieve genuine complementarity in regression tasks but not in classification under common aggregation and loss assumptions; in particular, selector-based protocols never yield gains and classification aggregators satisfying natural monotonicity constraints are provably obstructed.
Complementarity is the case in which a human--AI interaction (HAI) outperforms the best prediction benchmark available among its members. Although this idea is central in HAI research, formal work on complementarity remains limited. Existing frameworks do not model how agents' predictions compose into workflow-sensitive multi-agent protocols. We close this gap by introducing a tree-based formalization of complementarity in multi-agent HAI. An HAI protocol is represented by an ordered agent-role configuration together with a rooted planar binary tree whose leaves are decorated by prediction vectors. A local binary composition rule is evaluated recursively along the tree, yielding a tree-relative complementarity functional relative to a pointwise-min oracle benchmark. We prove four results. First, selector-based HAIs, including self- or AI-reliance, cannot achieve complementarity regardless of task, loss, or prediction quality. Second, in regression under squared loss, complementarity is equivalent to Euclidean distance minimization from the ground-truth vector; for $N=2$, the optimal linear-pooling weight has a closed form and a residual-correction interpretation. Third, under linear local composition, every protocol tree defines a barycentric coordinate chart on the simplex of leaf weights; Tamari-cover reparameterizations of protocol trees preserve complementarity, and for $N=4$, they satisfy the pentagon identity. Fourth, in binary classification, no internal local composition can achieve complementarity under endpoint-monotone losses, including standard Bregman and many finite Bernoulli $f$-divergence losses; an analogous obstruction holds for multiclass aggregation under cross-entropy. In summary, our framework shows that complementarity is attainable in multi-agent regression, but obstructed in classification under natural conditions on local aggregation and loss functions.
Summary
Main Finding
The paper introduces a formal, tree-based framework for multi-agent human–AI interactions (HAI) and shows that whether a multi-agent protocol can achieve complementarity critically depends on (1) task type (regression vs classification), (2) the choice of benchmark (pointwise-min oracle), and (3) the local aggregation rule used at binary composition nodes of a protocol tree. Key results: - Selector/internal rules (including self- or AI-reliance that pick coordinates from inputs) can never achieve complementarity relative to the pointwise-min oracle, for any task or prediction quality. - In regression with squared loss and linear local aggregation, complementarity is attainable and equivalent to minimizing Euclidean distance to the ground-truth vector. For N = 2 there is a closed-form optimal pooling weight with a residual-correction interpretation. - For linear pooling, every protocol tree gives a barycentric coordinate chart on the simplex of leaf weights; Tamari-cover reparameterizations preserve complementarity and satisfy the pentagon coherence identity for N = 4. - In binary classification, any internal (coordinatewise within-input-range) local composition cannot beat the pointwise-min benchmark under broad endpoint-monotone losses (including Bregman and binary cross-entropy); an analogous obstruction holds for coordinatewise-internal multiclass aggregation under cross-entropy. Escaping this impossibility requires non-internal (amplifying/externally-extending) local rules.
Key Points
- Formal model: an ordered agent-role configuration + a rooted planar binary (protocol) tree whose leaves are agents’ prediction vectors. A local binary composition rule m2 is applied recursively to produce the protocol output.
- Benchmark choice matters: the paper uses a strict pointwise-min oracle (casewise minimum loss across agents before averaging). This benchmark is at least as demanding as the usual aggregate benchmark (min after averaging).
- Impossibility for selectors/internal rules: Any nodewise selector (coordinatewise choose between two inputs) cannot produce outputs that strictly improve upon the pointwise-min oracle—so reliance/deferral-style outputs cannot yield complementarity under this benchmark.
- Regression positive result: Under squared loss, maximizing complementarity reduces to minimizing Euclidean distance between the protocol output and the true-label vector. For N = 2 and linear pooling ŷ = α ŷ1 + (1 − α) ŷ2, the optimal α has closed form; complementarity arises when human–AI disagreement corrects the AI residual in the right direction and magnitude.
- Algebraic/geometric structure: For linear local composition, protocol trees parametrize leaf weight simplices via barycentric coordinates; Tamari moves change tree topology but can be compensated by reparameterization that preserves complementarity. For N = 4 these reparameterizations obey the pentagon identity.
- Classification obstruction: For endpoint-monotone losses, any internal composition keeps outputs within the convex hull of inputs coordinatewise, preventing improvement over the pointwise-min benchmark. Non-internal rules (e.g., amplified logit pooling) are needed to possibly attain complementarity in classification.
- Practical interpretation: complementarity is easier/possible in multi-agent regression protocols with synthesizing (non-selector) aggregators; it is intrinsically blocked in many natural classification settings if aggregators are restricted to internal/selector-like rules.
Data & Methods
- Nature of the work: theoretical/formal. No empirical datasets or experiments—results are derived mathematically.
- Formal definitions: prediction task τ = (X, Y, bY, ℓ); labeled dataset D; agents produce empirical prediction vectors ŷ(i) ∈ bY^n. Ordered agent-role configurations capture workflow order; rooted planar binary trees (leaves labeled left-to-right by the ordered agents) capture protocol topology.
- Interaction model: Assumption of local binary composition — every internal node applies a two-input rule m2 : bY^n × bY^n → bY^n; the tree output ŷT is obtained by recursive application of m2.
- Complementarity functional: Ψ(leaf predictions; D) = Φ(leaf predictions; D) − Θ(ŷT; D), where Φ is the benchmark (chosen as the pointwise minimum of the per-case losses across leaves) and Θ is the empirical average loss of the protocol output.
- Theoretical tools and proofs:
- Task-independent impossibility proofs for selector/internal rules relative to the pointwise-min benchmark.
- Regression analysis under squared loss: algebraic derivation showing equivalence to Euclidean distance minimization; closed-form weight for N = 2 linear pooling and residual-correction geometric interpretation.
- Geometric/algebraic analysis of linear pooling across trees: barycentric coordinate charts, Tamari-cover reparameterizations, and pentagon identity.
- Classification impossibility proofs for endpoint-monotone losses (covering many Bregman and finite Bernoulli f-divergence losses, including cross-entropy).
- Assumptions and limitations explicitly used:
- All agents act on the same labeled dataset and predict the same target.
- Protocols are binary trees (binary/pairwise local interaction).
- Benchmark is pointwise-min oracle; locality and internality properties of m2 are central to impossibility results.
- Results are formal/mathematical and need empirical evaluation for real-world settings.
Implications for AI Economics
- Evaluation standards and procurement
- Choice of benchmark affects whether multi-agent HAI is judged complementary. Using the stricter pointwise-min oracle makes complementarity harder to obtain; contracting, procurement, and regulatory evaluation should be explicit about whether they compare to aggregate- or casewise-best baselines.
- For high-stakes domains (medical diagnosis, safety-critical classification), if casewise best-available-standalone predictions are the appropriate baseline, many common composition/deferral schemes will not deliver complementary performance; procurement should require synthesizing aggregators or permit non-internal composition.
- Product design, markets, and competition
- Vendors of aggregators/ensemble systems can create value beyond simple selection/deferral only by using non-selector, synthesizing composition rules (especially important for classification markets where internal rules fail). This creates market opportunities for novel aggregator modules (e.g., learned pooling that can extrapolate beyond inputs).
- For regression-heavy industries (pricing, forecasting, demand estimation), multi-agent protocols with linear/non-selector pooling are more likely to yield measurable complementarity — implying higher returns to investing in workflow design and aggregation algorithms.
- Labor and task allocation
- The formal obstruction to selector-based complementarity suggests that mere sequencing or allocation (i.e., better routing or deferral) may not produce gains unless interaction rules synthesize information. This affects the value of human labor complementarities with AI: training humans to produce corrective disagreement directions (residual-correcting behaviors) or enabling aggregators that combine signals can increase joint productivity.
- Compensation and incentives: firms should incentivize agents (human and AI-tool designers) to produce useful, non-redundant signals and to design aggregators that extract synergistic corrections rather than merely selecting among inputs.
- Innovation incentives and R&D
- Theoretical positive results for regression and the algebraic structure (Tamari invariance) imply that investment in flexible, parameterized aggregation methods can be robust to workflow reordering; R&D may focus on learning barycentric weights or reparameterizations rather than redesigning entire workflows.
- In classification, the necessity of non-internal rules to escape impossibility suggests new research directions (and commercial value) for calibrated amplification or learned transforms that safely extend beyond convex hulls of inputs—this raises safety, calibration, and regulatory scrutiny concerns.
- Policy and regulation
- Regulators evaluating human–AI systems should require transparent specification of aggregation rules and baselines (casewise vs aggregate). For sensitive classification tasks, ensuring complementarity may require approving non-standard aggregators and verifying that amplification does not induce harmful miscalibration or distributional harms.
- Empirical and macroeconomic research directions
- Empirical HAI economics should distinguish regression vs classification domains when estimating complementarities; cross-sectional studies that mix tasks risk misinterpreting the prevalence of complementarity.
- Macro estimates of productivity gains from AI that assume generic human–AI complementarity may overstate gains in classification-heavy sectors; sectoral heterogeneity must be modeled.
- Future empirical work should test the pointwise-min oracle as a benchmark and evaluate whether practical aggregators can implement non-internal transforms safely and reliably.
Suggested next steps for economists and practitioners - Empirically test the framework: compare aggregate vs pointwise-min benchmarks across domains; implement linear vs selector vs non-internal aggregators in laboratory and field settings. - For procurement/contracts: specify the benchmark and require evidence that chosen aggregators can surpass the pointwise-min baseline, or accept the implication that selectors will not. - For firms and vendors: invest in synthesizing aggregation methods and training human agents to produce residual-correcting signals (especially where tasks are regression-like). - For policy: require disclosure of aggregation rules and baseline definitions; consider calibration and safety criteria when non-internal amplifying aggregators are deployed.
Overall, the paper provides a rigorous algebraic/geometric lens on when and how multi-agent HAI can be complementary. For AI economics, it highlights that complementarity is not automatic and depends on task type, benchmark choice, and the nature of local aggregation—insights that should shape measurement, contracting, product strategy, and regulation.
Assessment
Claims (6)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Selector-based HAIs, including self- or AI-reliance, cannot achieve complementarity regardless of task, loss, or prediction quality. Output Quality | negative | high | complementarity (HAI performance relative to best member) |
0.2
|
| In regression under squared loss, complementarity is equivalent to Euclidean distance minimization from the ground-truth vector. Output Quality | mixed | high | complementarity (as characterization via Euclidean distance) |
0.2
|
| For N = 2 in regression under squared loss, the optimal linear-pooling weight has a closed form and admits a residual-correction interpretation. Task Allocation | positive | high | optimal pooled prediction weight (performance of 2-agent aggregation) |
0.2
|
| Under linear local composition, every protocol tree defines a barycentric coordinate chart on the simplex of leaf weights; Tamari-cover reparameterizations of protocol trees preserve complementarity, and for N = 4 these reparameterizations satisfy the pentagon identity. Task Allocation | positive | high | structural invariances of aggregation protocols (preservation of complementarity under reparameterization) |
0.2
|
| In binary classification, no internal local composition can achieve complementarity under endpoint-monotone losses (including standard Bregman and many finite Bernoulli f-divergence losses); an analogous obstruction holds for multiclass aggregation under cross-entropy. Output Quality | negative | high | complementarity in classification aggregation |
0.2
|
| Overall, complementarity is attainable in multi-agent regression but obstructed in classification under natural conditions on local aggregation and loss functions. Output Quality | mixed | high | attainability of complementarity across problem classes (regression vs classification) |
0.2
|