An AI that challenges users’ reasoning improves judgment: when the system generates counterfactual critiques of human rationales, people rely less on the AI and make more accurate house-price predictions, though they report higher cognitive effort; the gains are strongest for participants comfortable with AI.

Understanding the Effects of AI-Assisted Critical Thinking on Human-AI Decision Making

Harry Yizhou Tian, H. Amin, Ming Yin · Fetched June 06, 2026 · Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems

semantic_scholar quasi_experimental medium evidence 7/10 relevance Full text usable extracted full text DOI Source PDF

Structured author observations

Linked only from stored provider relations; the raw author line above is never matched by name.

OpenAlex

Latest observation: July 23, 2026

Harry Yizhou Tian provider ID
Hasan Amin provider ID
Ming Yin exact ORCID

Semantic Scholar

Latest observation: July 23, 2026

Harry Yizhou Tian provider ID
H. Amin provider ID
Ming Yin provider ID

AACT — an AI that critiques users' own decision rationales via counterfactual analysis — reduced over-reliance and improved accuracy in a house-price prediction case study but increased reported cognitive load and benefited some subgroups (e.g., users familiar with AI).

Citation observations

Cumulative provider counts captured on specific dates; providers are never combined.

2 cumulative citations

OpenAlex · Observed July 22, 2026

View corpus context

1 cumulative citations

Semantic Scholar · Observed July 22, 2026

View corpus context

Despite the growing prevalence of human-AI decision making, the human-AI team’s decision performance often remains suboptimal, partially due to insufficient examination of humans’ own reasoning. In this paper, we explore designing AI systems that directly analyze humans’ decision rationales and encourage critical reflection of their own decisions. We introduce the AI-Assisted Critical Thinking (AACT) framework, which leverages a domain-specific AI model’s counterfactual analysis of human decision to help decision-makers identify potential flaws in their decision argument and support the correction of them. Through a case study on house price prediction, we find that AACT outperforms traditional AI-based decision-support in reducing over-reliance on AI, though also triggering higher cognitive load. Subgroup analysis reveals AACT can be particularly beneficial for some decision-makers such as those very familiar with AI technologies. We conclude by discussing the practical implications of our findings, use cases and design choices of AACT, and considerations for using AI to facilitate critical thinking.

Summary

Main Finding

AI-Assisted Critical Thinking (AACT)—a domain-specific AI that elicits a user’s decision rationale, adopts that perspective, and performs counterfactual analyses to critique and suggest corrections—reduces human over-reliance on AI compared to standard XAI and hypothesis-driven XAI. This benefit comes with higher cognitive load and the risk of increased under-reliance. Effects are heterogeneous: AACT is especially helpful for users who are more familiar with AI, more domain-experienced, or more educated.

Key Points

Motivation
- Many human–AI decision systems fail to achieve complementarity because humans engage superficially with AI outputs and rely inappropriately (over- or under-reliance).
- Existing reflection-oriented interventions focus on AI outputs (timing, framing, alternative AI hypotheses) rather than analyzing and improving the human’s own reasoning.
AACT framework
- Builds on the Recognition/Metacognition (R/M) model: Recognize → Critique → Correct → Quick-test.
- Elicits the human’s decision argument (their evidence→conclusion mapping).
- Uses a domain-specific model to adopt the human perspective and run counterfactual analyses against that argument, identifying three critique types: incompleteness (missing information), unreliability (weak/assumed links), and conflict (competing arguments).
- Supports correction via (1) AI-based revision suggestions and (2) data-based triangulation (insights from the AI’s training data as external checks).
Empirical results (high-level)
- Controlled online study (house price prediction task) comparing AACT to XAI and hypothesis-driven XAI.
- Primary outcomes: decision accuracy, appropriate reliance on AI, task learning, and subjective perceptions.
- AACT reduced over-reliance more than traditional approaches, but increased reported mental demand and sometimes led to under-reliance.
- Subgroup analyses: AACT’s anti-over-reliance effects were stronger for participants with higher domain familiarity, greater AI familiarity, and higher education. Some participants preferred AACT for greater autonomy and reflective engagement.
Trade-offs & design considerations
- AACT shifts AI’s role toward a “thought partner” that interrogates human reasoning rather than merely explaining model outputs.
- Requires eliciting user rationales and deploying a domain-grounded model (rather than generic LLM-only feedback).
- Designers must balance promoting reflection vs. raising cognitive load and the risk of under-trusting AI.

Data & Methods

Instantiation
- AACT implemented as a conversational AI system that: (a) collects the user’s decision rationale, (b) performs counterfactual analyses grounded in a domain-specific model, (c) surfaces targeted critique questions, and (d) offers correction suggestions plus data triangulation from training data.
Experimental design
- Controlled between-condition user study recruited via Prolific.
- Task domain: house price prediction (participants form a price estimate and provide rationale; then interact with one of the interfaces).
- Comparison arms: AACT vs. standard Explainable AI (XAI) vs. Hypothesis-driven XAI.
- Outcome measures: decision accuracy (quality of final decisions), reliance metrics (frequency of accepting AI advice when incorrect = over-reliance, and rejecting correct AI advice = under-reliance), task learning, and subjective measures including cognitive load (mental demand) and preference.
- Subgroup analyses by participant characteristics (domain familiarity, AI familiarity, education).
Key empirical findings (qualitative summary)
- AACT reduces the rate at which participants follow incorrect AI predictions (reduced over-reliance) compared to the other interfaces.
- AACT users reported higher mental demand and sometimes distrusted AI more (leading to under-reliance).
- Heterogeneous treatment effects suggest targeted benefits for certain user segments.

Implications for AI Economics

Productivity and complementarity
- AACT-type systems can increase effective human-AI complementarity by improving human judgment calibration (fewer costly errors from over-reliance). This can raise the value of AI deployment in tasks where human oversight is needed.
- However, higher cognitive load and potential under-reliance imply time-cost trade-offs—decision throughput may fall even if accuracy improves. Economic assessments should model both error-cost savings and increased decision-time costs.
Labor demand and skill-biased effects
- Heterogeneous benefits concentrated among users familiar with AI and domain knowledge suggest AACT could amplify returns to human skills (education, domain expertise, AI literacy). This implies potential skill-biased technological change: more skilled workers gain larger productivity improvements.
- For less experienced workers, AACT may impose cognitive burdens and be less effective; retraining or interface personalization could be necessary.
Deployment and firm-level strategy
- Firms should consider targeted deployment of AACT in roles where decisions are high-stakes and staff have sufficient domain/AI literacy, or invest in training to realize AACT’s benefits.
- Cost-benefit analyses need to include implementation costs (domain-specific model development, rationale elicitation workflows), ongoing maintenance (data-triangulation pipelines), and user training.
Market design and regulation
- In regulated/high-stakes domains (finance, medical, legal), adopting AACT could reduce regulatory risk by producing more auditable human rationales and explicit critique/correction trails—valuable for compliance and liability management.
- Regulators and procurement agents should evaluate both accuracy improvements and cognitive/behavioral effects (e.g., potential for under-trust).
Research & policy directions
- Quantify macro-level impacts: estimate how reductions in over-reliance translate into economic gains across sectors (e.g., reduced diagnostic errors, fewer mispriced assets).
- Study personalization economics: when does tailoring AACT to user skill pay off? What pricing or adoption models make sense for firms vs. individuals?
- Consider distributional effects: design policy or training programs to ensure lower-skill workers are not left behind by systems that principally benefit skilled users.

Limitations to keep in mind - Evidence is from a single case study (house prices) with an online Prolific sample; external validity to medical, legal, or operational settings requires further testing. - Paper reports high-level behavioral patterns; deployment economics will hinge on precise magnitudes (time costs, error-costs) not provided here. - AACT requires a reliable domain-specific model and data access for triangulation—feasibility varies by application.

If you want, I can extract recommended quantitative metrics to include in an economic model (e.g., estimate per-decision time increase, error-reduction rate) and sketch a simple back-of-the-envelope cost-benefit framework for deploying AACT in a specific industry (e.g., mortgage underwriting or clinical triage).

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper provides experimental-style evidence showing AACT improves decision behavior in a controlled case study, which supports causal interpretation, but the result is limited to one domain (house-price prediction), likely a modest sample and lab-style tasks, and faces external validity and demand-characteristic concerns. Methods Rigormedium — Design appears to be an experimental comparison with subgroup analysis and multiple outcome measures; however, the summary omits key methodological details (randomization procedure, sample size and representativeness, pre-registration, robustness checks, statistical controls, multiple hypothesis adjustments), so rigor cannot be judged high. SampleHuman participants performed house-price prediction tasks assisted by a domain-specific AI that generates counterfactual analyses of participants' rationales (AACT) versus a conventional AI decision-support interface; housing feature data and ground-truth prices were used to evaluate accuracy. The summary does not specify sample size, recruitment source (e.g., students, online workers, professionals), or whether participants had relevant domain expertise. Themeshuman_ai_collab productivity skills_training IdentificationBetween-condition comparison: participants completed house-price prediction tasks under the AACT interface versus a traditional AI decision-support interface; causal claims rest on the experimental manipulation (assignment to interface condition) and comparison of downstream outcomes (decision accuracy, over-reliance, cognitive load). (Paper does not report whether assignment was fully randomized or pre-registered in the provided summary.) GeneralizabilitySingle domain (house price prediction) — may not transfer to other decision domains or high-stakes settings, Likely lab/short-term task with limited ecological validity compared with real-world, repeated decision contexts, Participant pool likely non-representative (students or online workers) — limits applicability to professional decision-makers, Relies on a particular AI model and UI design; effects may differ with other model quality, explanation formats, or deployment constraints, Increased cognitive load tradeoff may reduce adoption in time- or attention-constrained real-world settings

Claims (5)

Claim	Direction	Outcome	Confidence & Evidence	Details
We introduce the AI-Assisted Critical Thinking (AACT) framework, which leverages a domain-specific AI model’s counterfactual analysis of human decision to help decision-makers identify potential flaws in their decision argument and support the correction of them. Decision Quality	positive	ability to identify and correct flaws in decision arguments	Reading fidelity high Study strength speculative	not reported 0.08
Through a case study on house price prediction, we find that AACT outperforms traditional AI-based decision-support in reducing over-reliance on AI. Automation Exposure	positive	over-reliance on AI	Reading fidelity high Study strength medium	not reported 0.48
AACT also triggers higher cognitive load. Worker Satisfaction	negative	cognitive load	Reading fidelity high Study strength medium	not reported 0.48
Subgroup analysis reveals AACT can be particularly beneficial for some decision-makers such as those very familiar with AI technologies. Decision Quality	positive	decision improvement for users familiar with AI (reduced over-reliance / improved decision performance)	Reading fidelity medium Study strength medium	not reported 0.29
Despite the growing prevalence of human-AI decision making, the human-AI team’s decision performance often remains suboptimal, partially due to insufficient examination of humans’ own reasoning. Decision Quality	negative	human-AI team decision performance	Reading fidelity high Study strength speculative	not reported 0.08