Marrying formal argumentation with large language models could make AI decisions inspectable and contestable, creating demand for verifiable AI services in regulated sectors; at the same time it will shift experts toward adjudication and oversight and create new governance needs.

Argumentative Human-AI Decision-Making: Toward AI Agents That Reason With Us, Not For Us

Stylianos Loukas Vasileiou, Antonio Rago, Francesca Toni, William Yeoh · March 16, 2026

arxiv theoretical n/a evidence 7/10 relevance Source PDF

Combining formal computational argumentation with LLMs can produce contestable, verifiable human–AI decision processes that improve trust and reshape firm offerings, expert roles, and regulatory needs in high‑stakes domains.

Computational argumentation offers formal frameworks for transparent, verifiable reasoning but has traditionally been limited by its reliance on domain-specific information and extensive feature engineering. In contrast, LLMs excel at processing unstructured text, yet their opaque nature makes their reasoning difficult to evaluate and trust. We argue that the convergence of these fields will lay the foundation for a new paradigm: Argumentative Human-AI Decision-Making. We analyze how the synergy of argumentation framework mining, argumentation framework synthesis, and argumentative reasoning enables agents that do not just justify decisions, but engage in dialectical processes where decisions are contestable and revisable -- reasoning with humans rather than for them. This convergence of computational argumentation and LLMs is essential for human-aware, trustworthy AI in high-stakes domains.

Summary

Main Finding

The paper argues that integrating computational argumentation with large language models (LLMs) creates a new paradigm—Argumentative Human-AI Decision‑Making—where AI agents do more than justify outputs: they participate in dialectical, contestable, and revisable decision processes with humans. This synergy promises human-aware, verifiable, and trustable AI for high‑stakes domains by combining formal argument structures with LLMs’ ability to mine and generate rich, contextual arguments from unstructured text.

Key Points

Motivation
- Computational argumentation: offers formal, verifiable reasoning representations (argumentation frameworks, attack/support relations), but has required heavy feature engineering and domain-specific knowledge.
- LLMs: excel at extracting and generating arguments from unstructured text but are opaque and hard to evaluate or trust.
Proposed convergence
- Argumentation Framework Mining: use LLMs and NLP pipelines to extract claims, premises, relations (attack/support), and provenance from text corpora.
- Argumentation Framework Synthesis: combine mined fragments into coherent formal argumentation frameworks (AFs) with explicit semantics, enabling verification and automated inference.
- Argumentative Reasoning & Interaction: run formal dialectical/acceptability semantics and dialogue protocols (contest, rebuttal, revision) enabling agents that reason with humans through structured debates and revisions.
Value-add
- Transparency and verifiability: structured AFs make chains of inference inspectable and machine-checkable.
- Contestability and revision: decisions are not final outputs but subject to dialectical challenge and update, increasing robustness and trust.
- Human‑AI collaboration: supports collaborative reasoning (“with” humans) rather than opaque automation “for” humans, improving uptake in high‑stakes settings.
Challenges
- Faithful extraction: aligning LLM-extracted arguments with formal AF primitives and ensuring fidelity to source evidence.
- Robustness & adversarial manipulation: AFs and LLMs may be gamed or misled; incentives may drive strategic argumentation.
- Evaluation: developing metrics and benchmarks for argument quality, fidelity, contestability, and human trust.
- Integration costs: domain modeling, human-in-the-loop protocols, and regulatory/ liability frameworks required.

Data & Methods

(Conceptual / design-oriented; empirical work would instantiate these components) - Data sources - Domain corpora with argumentative content: legal opinions, clinical notes and guidelines, policy reports, regulatory filings, parliamentary debates, curated debate corpora (e.g., UKP, Persuasive Essays, IBM Debater-type collections). - Expert annotations: labelled claims, premises, stances, attack/support relations and provenance for supervised or weakly supervised training. - Extraction & representation methods - Fine-tuning or prompting LLMs for argument mining (claim detection, premise identification, relation classification). - Information‑extraction pipelines to attach provenance, uncertainty, and source metadata. - Structured representations: convert extracted elements into formal AF primitives (nodes = arguments/claims; edges = attack/support; weights or probabilities for strength). - Synthesis & verification - Algorithms for merging fragmentary arguments into coherent AFs (graph synthesis, resolution of conflicts, canonicalization). - Formal semantics: Dung-style acceptability, bipolar AFs, weighted/probabilistic AFs; model checking to verify logical consistency and provenance constraints. - Hybrid systems: symbolic reasoning layers over LLM-derived content for constraints, counterfactual checks, and rule compliance. - Interaction & evaluation protocols - Dialectical dialogue protocols: structured challenge/response cycles, revision operators, and adjudication mechanisms (human adjudicators or automated semantics). - Evaluation metrics: task performance, fidelity (how accurately AF reflects source), argument quality (coherence, relevance, completeness), contestability (ability to find justified counterarguments), revision responsiveness, human trust and reliance, calibration of uncertainty. - Experimental designs: benchmark tasks, human-subject studies in domain-specific simulations, adversarial stress tests, longitudinal deployment case studies.

Implications for AI Economics

Adoption & market structure
- Demand shift toward AI systems that provide verifiable, contestable reasoning in regulated/high‑stakes sectors (healthcare, law, finance, public policy).
- Competitive advantage for firms offering argumentatively transparent AI—premium pricing for verifiability and auditability.
- Emergence of new service layers: argumentation-as-a-service, audit firms, explanation certification, and human-in-the-loop orchestration platforms.
Labor, productivity & complementarities
- Human experts shift from sole decision-makers to adjudicators, challengers, and validators of AI-generated arguments—changing skill demands toward critical evaluation and dialectical oversight.
- Potential productivity gains from faster evidence synthesis and argument exploration, but complementary tasks (validation, stakeholder engagement) create new labor demand.
Information asymmetry & transaction costs
- Structured AFs reduce information asymmetry by making reasoning traceable, lowering search and verification costs in transactions and contracting.
- Better contestability can reduce costly litigations and regulatory frictions if decisions are transparently defensible.
Incentives, strategic behavior & regulation
- New incentives for strategic argument construction (gaming, persuasion without fidelity) suggest need for governance: standards for provenance, certification, and liability rules.
- Regulators may prefer systems that support contestability and audit trails—potentially mandating argumentation-style explainability in certain sectors.
Welfare and risk
- Welfare gains from improved decision quality and trust in automation, particularly where human oversight is required.
- Risk of manipulation and misinformation if argument mining/synthesis is unregulated or misaligned with social incentives; externalities may justify public intervention (standards, audits, liability frameworks).
Research & policy agenda for economists
- Quantify value of contestable explanations: willingness-to-pay for verifiable reasoning vs. opaque predictive performance.
- Study labor market impacts: reallocation of tasks, wage effects for validators/adjudicators.
- Design mechanisms and contracts that align incentives for truthful argument provision and penalize strategic misrepresentation.
- Evaluate regulatory interventions: certification regimes, liability assignment, mandatory audit trails in high-stakes domains.

Overall, integrating computational argumentation with LLM capabilities creates economically significant opportunities (trustworthy, auditable AI services) and risks (strategic manipulation, new regulatory needs). For AI economics, the key questions are how these systems affect incentives, market structure, labor complementarity, and the design of policies that extract social value while containing potential harms.

Assessment

Paper Typetheoretical Evidence Strengthn/a — Conceptual/design paper: proposes a paradigm and implementation components but presents no empirical tests or causal estimates to evaluate impacts. Methods Rigorn/a — The manuscript outlines architectures, pipelines, and evaluation protocols but does not implement, validate, or benchmark them; methodological rigor cannot be assessed without empirical instantiation. SampleNo empirical sample; the paper is conceptual and recommends potential data sources for future work (legal opinions, clinical notes and guidelines, policy reports, regulatory filings, parliamentary debates, curated debate corpora like UKP/Persuasive Essays/IBM Debater, plus expert-annotated labels for claims, premises, relations and provenance). Themeshuman_ai_collab governance labor_markets adoption productivity GeneralizabilityConceptual only — claims are not empirically validated and may not hold once implemented., Domain specificity — recommended corpora (law, medicine, policy) differ in style and stakes; methods may not transfer to informal or low-quality text sources., Dependence on LLM capabilities — effectiveness hinges on current/future LLMs' fidelity in argument extraction and resistance to hallucination., Resource and annotation costs — practical deployment requires expensive expert annotation and domain modeling, limiting applicability to well-resourced organizations., Regulatory and institutional variation — legal/regulatory contexts differ across jurisdictions, constraining standardized adoption., Adversarial and strategic behavior risks — contestability can be gamed, reducing generalizability of welfare claims unless governance is in place.

Claims (26)

Claim	Direction	Confidence	Outcome	Details
Integrating computational argumentation with large language models (LLMs) creates a new paradigm—Argumentative Human-AI Decision‑Making—where AI agents participate in dialectical, contestable, and revisable decision processes with humans. Decision Quality	positive	medium	degree of human-AI dialectical participation (ability to engage in contestable, revisable decision processes)	0.01
Combining formal argument structures with LLMs’ ability to mine and generate rich, contextual arguments from unstructured text promises human-aware, verifiable, and trustable AI for high‑stakes domains. Ai Safety And Ethics	positive	medium	trustworthiness/verifiability of AI outputs in high-stakes decision contexts	0.01
Computational argumentation offers formal, verifiable reasoning representations (argumentation frameworks, attack/support relations). Ai Safety And Ethics	positive	high	existence and machine-checkability of formal inferential chains (inspectability/verifiability)	0.02
Computational argumentation approaches have required heavy feature engineering and domain-specific knowledge to be effective. Other	negative	high	engineering cost / domain modeling effort required for AF-based systems	0.02
LLMs excel at extracting and generating arguments from unstructured text but are opaque and hard to evaluate or trust. Ai Safety And Ethics	mixed	high	argument extraction/generation performance and model interpretability/trustworthiness	0.02
Argumentation Framework Mining: LLMs and NLP pipelines can be used to extract claims, premises, relations (attack/support), and provenance from text corpora. Output Quality	positive	medium	accuracy/fidelity of extracted argument elements (claims, premises, relations, provenance)	0.01
Argumentation Framework Synthesis: mined fragments can be combined into coherent formal argumentation frameworks (AFs) with explicit semantics enabling verification and automated inference. Output Quality	positive	medium	coherence and correctness of synthesized AFs and verifiability of derived inferences	0.01
Running formal dialectical/acceptability semantics and dialogue protocols over AFs enables agents that reason with humans through structured debates and revisions. Decision Quality	positive	medium	capacity for structured debate/revision (dialogue performance, acceptability outcomes)	0.01
Structured argumentation frameworks make chains of inference inspectable and machine-checkable, improving transparency and verifiability of AI outputs. Ai Safety And Ethics	positive	high	inspectability/traceability of inference chains (auditability)	0.02
Framing decisions as contestable and revisable (via dialectical challenge and update) increases robustness and trust in AI-supported decision-making. Decision Quality	positive	medium	measures of robustness (resilience to error) and human trust in decisions	0.01
This approach supports collaborative reasoning ('with' humans) rather than opaque automation 'for' humans, improving uptake in high‑stakes settings. Adoption Rate	positive	medium	human adoption/uplift in uptake for high-stakes decision systems	0.01
Faithful extraction—aligning LLM-extracted arguments with formal AF primitives and ensuring fidelity to source evidence—is a key technical challenge. Error Rate	negative	high	fidelity/alignment error rate between extracted elements and source evidence	0.02
AFs and LLMs may be gamed or misled; adversaries may exploit systems leading to strategic argumentation or manipulation. Ai Safety And Ethics	negative	high	system vulnerability metrics / susceptibility to adversarial manipulation	0.02
Evaluation currently lacks metrics and benchmarks for argument quality, fidelity, contestability, and human trust; developing these is necessary. Research Productivity	null_result	high	availability and maturity of evaluation metrics and benchmarks	0.02
Integration costs—domain modeling, human-in-the-loop protocols, and regulatory/liability frameworks—are significant barriers to deployment. Organizational Efficiency	negative	high	implementation cost and organizational burden for deploying argumentative AI systems	0.02
Demand will shift toward AI systems that provide verifiable, contestable reasoning in regulated/high‑stakes sectors (healthcare, law, finance, public policy). Adoption Rate	positive	medium	market demand share for verifiable/contestable AI systems in regulated sectors	0.01
Firms offering argumentatively transparent AI can obtain competitive advantage and charge premium prices for verifiability and auditability. Firm Revenue	positive	medium	price premium and competitive advantage metrics for transparent-AI providers	0.01
New service layers may emerge (argumentation-as-a-service, audit firms, explanation certification, human-in-the-loop orchestration platforms). Market Structure	positive	low	emergence and market size of new service verticals around argumentative AI	0.01
Human experts will likely shift roles from sole decision-makers to adjudicators, challengers, and validators of AI-generated arguments, changing required skills toward critical evaluation and dialectical oversight. Skill Acquisition	mixed	medium	changes in job tasks, skill demand, and employment shares for expert validators/adjudicators	0.01
Structured AFs can reduce information asymmetry by making reasoning traceable, thereby lowering search and verification costs in transactions and contracting. Organizational Efficiency	positive	medium	reduction in transaction/search/verification costs attributable to traceable AFs	0.01
Better contestability may reduce litigation and regulatory frictions if decisions are transparently defensible. Regulatory Compliance	positive	low	frequency/cost of litigation and regulatory disputes post-adoption of contestable AI systems	0.01
The possibility of strategic argument construction (gaming) motivates governance needs: standards for provenance, certification, and liability rules. Governance And Regulation	neutral	medium	existence and effectiveness of governance mechanisms (standards, certification, liability) addressing strategic manipulation	0.01
Regulators may prefer systems that support contestability and audit trails and could mandate argumentation-style explainability in certain sectors. Governance And Regulation	positive	low	regulatory adoption rate of contestability/audit-trail requirements	0.01
There are potential welfare gains from improved decision quality and trust in automation, particularly where human oversight remains required. Consumer Welfare	positive	medium	welfare indicators (decision quality gains, trust levels, social surplus) from argumentative AI	0.01
There is a risk of manipulation and misinformation if argument mining/synthesis is unregulated or misaligned with social incentives, creating externalities that may justify public intervention. Ai Safety And Ethics	negative	medium	incidence of manipulation/misinformation attributable to argument-mining/synthesis systems	0.01
Research agenda items for economists include: quantifying willingness-to-pay for verifiable reasoning, studying labor-market impacts for validators, designing contracts/mechanisms to incentivize truthful argument provision, and evaluating regulatory interventions. Research Productivity	null_result	high	existence and prioritization of empirical research on WTP, labor impacts, mechanism design, and regulation evaluations	0.02