A new mathematical framework shows human–AI teams can achieve genuine complementarity in regression tasks but not in classification under common aggregation and loss assumptions; in particular, selector-based protocols never yield gains and classification aggregators satisfying natural monotonicity constraints are provably obstructed.

Tree-Based Formalization of Multi-Agent Complementarity in Human-AI Interactions

Andrea Ferrario · June 03, 2026

arxiv theoretical n/a evidence 7/10 relevance Source PDF

The paper formalizes multi-agent human–AI prediction workflows using tree-structured composition rules and shows complementarity is attainable for multi-agent regression under squared loss but is obstructed for classification under broad, natural aggregation and loss conditions.

Complementarity is the case in which a human--AI interaction (HAI) outperforms the best prediction benchmark available among its members. Although this idea is central in HAI research, formal work on complementarity remains limited. Existing frameworks do not model how agents' predictions compose into workflow-sensitive multi-agent protocols. We close this gap by introducing a tree-based formalization of complementarity in multi-agent HAI. An HAI protocol is represented by an ordered agent-role configuration together with a rooted planar binary tree whose leaves are decorated by prediction vectors. A local binary composition rule is evaluated recursively along the tree, yielding a tree-relative complementarity functional relative to a pointwise-min oracle benchmark. We prove four results. First, selector-based HAIs, including self- or AI-reliance, cannot achieve complementarity regardless of task, loss, or prediction quality. Second, in regression under squared loss, complementarity is equivalent to Euclidean distance minimization from the ground-truth vector; for $N=2$, the optimal linear-pooling weight has a closed form and a residual-correction interpretation. Third, under linear local composition, every protocol tree defines a barycentric coordinate chart on the simplex of leaf weights; Tamari-cover reparameterizations of protocol trees preserve complementarity, and for $N=4$, they satisfy the pentagon identity. Fourth, in binary classification, no internal local composition can achieve complementarity under endpoint-monotone losses, including standard Bregman and many finite Bernoulli $f$-divergence losses; an analogous obstruction holds for multiclass aggregation under cross-entropy. In summary, our framework shows that complementarity is attainable in multi-agent regression, but obstructed in classification under natural conditions on local aggregation and loss functions.

Summary

Main Finding

The paper introduces a formal, tree-based framework for multi-agent human–AI interactions (HAI) and shows that whether a multi-agent protocol can achieve complementarity critically depends on (1) task type (regression vs classification), (2) the choice of benchmark (pointwise-min oracle), and (3) the local aggregation rule used at binary composition nodes of a protocol tree. Key results: - Selector/internal rules (including self- or AI-reliance that pick coordinates from inputs) can never achieve complementarity relative to the pointwise-min oracle, for any task or prediction quality. - In regression with squared loss and linear local aggregation, complementarity is attainable and equivalent to minimizing Euclidean distance to the ground-truth vector. For N = 2 there is a closed-form optimal pooling weight with a residual-correction interpretation. - For linear pooling, every protocol tree gives a barycentric coordinate chart on the simplex of leaf weights; Tamari-cover reparameterizations preserve complementarity and satisfy the pentagon coherence identity for N = 4. - In binary classification, any internal (coordinatewise within-input-range) local composition cannot beat the pointwise-min benchmark under broad endpoint-monotone losses (including Bregman and binary cross-entropy); an analogous obstruction holds for coordinatewise-internal multiclass aggregation under cross-entropy. Escaping this impossibility requires non-internal (amplifying/externally-extending) local rules.

Key Points

Formal model: an ordered agent-role configuration + a rooted planar binary (protocol) tree whose leaves are agents’ prediction vectors. A local binary composition rule m2 is applied recursively to produce the protocol output.
Benchmark choice matters: the paper uses a strict pointwise-min oracle (casewise minimum loss across agents before averaging). This benchmark is at least as demanding as the usual aggregate benchmark (min after averaging).
Impossibility for selectors/internal rules: Any nodewise selector (coordinatewise choose between two inputs) cannot produce outputs that strictly improve upon the pointwise-min oracle—so reliance/deferral-style outputs cannot yield complementarity under this benchmark.
Regression positive result: Under squared loss, maximizing complementarity reduces to minimizing Euclidean distance between the protocol output and the true-label vector. For N = 2 and linear pooling ŷ = α ŷ1 + (1 − α) ŷ2, the optimal α has closed form; complementarity arises when human–AI disagreement corrects the AI residual in the right direction and magnitude.
Algebraic/geometric structure: For linear local composition, protocol trees parametrize leaf weight simplices via barycentric coordinates; Tamari moves change tree topology but can be compensated by reparameterization that preserves complementarity. For N = 4 these reparameterizations obey the pentagon identity.
Classification obstruction: For endpoint-monotone losses, any internal composition keeps outputs within the convex hull of inputs coordinatewise, preventing improvement over the pointwise-min benchmark. Non-internal rules (e.g., amplified logit pooling) are needed to possibly attain complementarity in classification.
Practical interpretation: complementarity is easier/possible in multi-agent regression protocols with synthesizing (non-selector) aggregators; it is intrinsically blocked in many natural classification settings if aggregators are restricted to internal/selector-like rules.

Data & Methods

Nature of the work: theoretical/formal. No empirical datasets or experiments—results are derived mathematically.
Formal definitions: prediction task τ = (X, Y, bY, ℓ); labeled dataset D; agents produce empirical prediction vectors ŷ(i) ∈ bY^n. Ordered agent-role configurations capture workflow order; rooted planar binary trees (leaves labeled left-to-right by the ordered agents) capture protocol topology.
Interaction model: Assumption of local binary composition — every internal node applies a two-input rule m2 : bY^n × bY^n → bY^n; the tree output ŷT is obtained by recursive application of m2.
Complementarity functional: Ψ(leaf predictions; D) = Φ(leaf predictions; D) − Θ(ŷT; D), where Φ is the benchmark (chosen as the pointwise minimum of the per-case losses across leaves) and Θ is the empirical average loss of the protocol output.
Theoretical tools and proofs:
- Task-independent impossibility proofs for selector/internal rules relative to the pointwise-min benchmark.
- Regression analysis under squared loss: algebraic derivation showing equivalence to Euclidean distance minimization; closed-form weight for N = 2 linear pooling and residual-correction geometric interpretation.
- Geometric/algebraic analysis of linear pooling across trees: barycentric coordinate charts, Tamari-cover reparameterizations, and pentagon identity.
- Classification impossibility proofs for endpoint-monotone losses (covering many Bregman and finite Bernoulli f-divergence losses, including cross-entropy).
Assumptions and limitations explicitly used:
- All agents act on the same labeled dataset and predict the same target.
- Protocols are binary trees (binary/pairwise local interaction).
- Benchmark is pointwise-min oracle; locality and internality properties of m2 are central to impossibility results.
- Results are formal/mathematical and need empirical evaluation for real-world settings.

Implications for AI Economics

Evaluation standards and procurement
- Choice of benchmark affects whether multi-agent HAI is judged complementary. Using the stricter pointwise-min oracle makes complementarity harder to obtain; contracting, procurement, and regulatory evaluation should be explicit about whether they compare to aggregate- or casewise-best baselines.
- For high-stakes domains (medical diagnosis, safety-critical classification), if casewise best-available-standalone predictions are the appropriate baseline, many common composition/deferral schemes will not deliver complementary performance; procurement should require synthesizing aggregators or permit non-internal composition.
Product design, markets, and competition
- Vendors of aggregators/ensemble systems can create value beyond simple selection/deferral only by using non-selector, synthesizing composition rules (especially important for classification markets where internal rules fail). This creates market opportunities for novel aggregator modules (e.g., learned pooling that can extrapolate beyond inputs).
- For regression-heavy industries (pricing, forecasting, demand estimation), multi-agent protocols with linear/non-selector pooling are more likely to yield measurable complementarity — implying higher returns to investing in workflow design and aggregation algorithms.
Labor and task allocation
- The formal obstruction to selector-based complementarity suggests that mere sequencing or allocation (i.e., better routing or deferral) may not produce gains unless interaction rules synthesize information. This affects the value of human labor complementarities with AI: training humans to produce corrective disagreement directions (residual-correcting behaviors) or enabling aggregators that combine signals can increase joint productivity.
- Compensation and incentives: firms should incentivize agents (human and AI-tool designers) to produce useful, non-redundant signals and to design aggregators that extract synergistic corrections rather than merely selecting among inputs.
Innovation incentives and R&D
- Theoretical positive results for regression and the algebraic structure (Tamari invariance) imply that investment in flexible, parameterized aggregation methods can be robust to workflow reordering; R&D may focus on learning barycentric weights or reparameterizations rather than redesigning entire workflows.
- In classification, the necessity of non-internal rules to escape impossibility suggests new research directions (and commercial value) for calibrated amplification or learned transforms that safely extend beyond convex hulls of inputs—this raises safety, calibration, and regulatory scrutiny concerns.
Policy and regulation
- Regulators evaluating human–AI systems should require transparent specification of aggregation rules and baselines (casewise vs aggregate). For sensitive classification tasks, ensuring complementarity may require approving non-standard aggregators and verifying that amplification does not induce harmful miscalibration or distributional harms.
Empirical and macroeconomic research directions
- Empirical HAI economics should distinguish regression vs classification domains when estimating complementarities; cross-sectional studies that mix tasks risk misinterpreting the prevalence of complementarity.
- Macro estimates of productivity gains from AI that assume generic human–AI complementarity may overstate gains in classification-heavy sectors; sectoral heterogeneity must be modeled.
- Future empirical work should test the pointwise-min oracle as a benchmark and evaluate whether practical aggregators can implement non-internal transforms safely and reliably.

Suggested next steps for economists and practitioners - Empirically test the framework: compare aggregate vs pointwise-min benchmarks across domains; implement linear vs selector vs non-internal aggregators in laboratory and field settings. - For procurement/contracts: specify the benchmark and require evidence that chosen aggregators can surpass the pointwise-min baseline, or accept the implication that selectors will not. - For firms and vendors: invest in synthesizing aggregation methods and training human agents to produce residual-correcting signals (especially where tasks are regression-like). - For policy: require disclosure of aggregation rules and baseline definitions; consider calibration and safety criteria when non-internal amplifying aggregators are deployed.

Overall, the paper provides a rigorous algebraic/geometric lens on when and how multi-agent HAI can be complementary. For AI economics, it highlights that complementarity is not automatic and depends on task type, benchmark choice, and the nature of local aggregation—insights that should shape measurement, contracting, product strategy, and regulation.

Assessment

Paper Typetheoretical Evidence Strengthn/a — The contribution is purely theoretical: it proves formal results about a tree-based model of human–AI prediction composition rather than providing empirical or causal estimates, so empirical evidence strength is not applicable. Methods Rigorhigh — The paper develops a formal mathematical framework and proves multiple general theorems (existence/obstruction results, equivalences, closed-form solutions, structural invariances) under clearly stated assumptions, indicating strong mathematical rigor. SampleNo empirical sample — the paper constructs a formal model: an ordered agent-role configuration with a rooted planar binary tree whose leaves are prediction vectors, a recursive local binary composition rule, and analysis relative to a pointwise-min oracle; results specialize to regression with squared loss, binary/multiclass classification under endpoint-monotone losses and various local aggregation families (selector-based, linear pooling, barycentric charts, Tamari reparameterizations). Themeshuman_ai_collab productivity GeneralizabilityPurely theoretical model; no empirical validation on real human–AI workflows or datasets, Results depend on specific technical assumptions (rooted planar binary trees, local binary composition rules, pointwise-min oracle benchmark), Positive results limited to regression with squared loss and particular aggregation rules (e.g., linear pooling); negative results hold under endpoint-monotone losses and certain aggregation constraints, so other loss/aggregation choices may evade the obstructions, Ignores practical considerations such as human cognitive limits, incentives, costs, temporal dynamics, feedback loops, and noisy or strategic behavior, Scaling and implementation issues for large N or rich prediction spaces are not empirically addressed

Claims (6)

Claim	Direction	Confidence	Outcome	Details
Selector-based HAIs, including self- or AI-reliance, cannot achieve complementarity regardless of task, loss, or prediction quality. Output Quality	negative	high	complementarity (HAI performance relative to best member)	0.2
In regression under squared loss, complementarity is equivalent to Euclidean distance minimization from the ground-truth vector. Output Quality	mixed	high	complementarity (as characterization via Euclidean distance)	0.2
For N = 2 in regression under squared loss, the optimal linear-pooling weight has a closed form and admits a residual-correction interpretation. Task Allocation	positive	high	optimal pooled prediction weight (performance of 2-agent aggregation)	0.2
Under linear local composition, every protocol tree defines a barycentric coordinate chart on the simplex of leaf weights; Tamari-cover reparameterizations of protocol trees preserve complementarity, and for N = 4 these reparameterizations satisfy the pentagon identity. Task Allocation	positive	high	structural invariances of aggregation protocols (preservation of complementarity under reparameterization)	0.2
In binary classification, no internal local composition can achieve complementarity under endpoint-monotone losses (including standard Bregman and many finite Bernoulli f-divergence losses); an analogous obstruction holds for multiclass aggregation under cross-entropy. Output Quality	negative	high	complementarity in classification aggregation	0.2
Overall, complementarity is attainable in multi-agent regression but obstructed in classification under natural conditions on local aggregation and loss functions. Output Quality	mixed	high	attainability of complementarity across problem classes (regression vs classification)	0.2