The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Forcing workers to use generative AI together reduced both document output and average quality, while a short training that framed AI as a thought partner improved quality among the best performers; but session timing imbalances, dropout and automated grading quirks mean the results should be interpreted cautiously.

Scaffolding Human-AI Collaboration: A Field Experiment on Behavioral Protocols and Cognitive Reframing
Alex Farach, Alexia Cambon, Lev Tankelevitch, Connie Hsueh, Rebecca Janssen · April 09, 2026
arxiv rct medium evidence 8/10 relevance Source PDF
In a field RCT with 388 retail employees, mandating structured joint AI use reduced document quantity and quality, while brief cognitive partnership training modestly raised top-end individual document quality, though results are weakened by session confounds, attrition, and grading sensitivity.

Organizations have widely deployed generative AI tools, yet productivity gains remain uneven, suggesting that how people use AI matters as much as whether they have access. We conducted a field experiment with 388 employees at a Fortune 500 retailer to test two scaffolding interventions for human-AI collaboration. All participants had access to the same AI tool; we varied only the structure surrounding its use. A behavioral scaffolding intervention (a structured protocol requiring joint AI use within pairs) was associated with lower document quality relative to unstructured use and substantially lower document production. A cognitive scaffolding intervention (partnership training that reframed AI as a thought partner) was associated with higher individual document quality at the top of the distribution. Treatment participants also showed greater positive belief change across the session, though sensitivity analyses suggest this likely reflects recovery from carry-over effects rather than genuine training-induced shifts. Both findings are subject to design limitations including an AM/PM session confound, differential attrition, and LLM grading sensitivity to document length.

Summary

Main Finding

A field experiment with 388 employees at a Fortune 500 retailer shows that the structure around AI use matters. A behavioral scaffold that mandated synchronous, joint AI use (the “Create-Out-Loud” protocol) reduced pair document production and lowered average pair document quality versus naturalistic use. A cognitive scaffold (partnership training that reframed AI as a “thought partner”) improved individual document quality at the top of the distribution, but did not produce a large average uplift. Reported positive belief changes for treatment participants likely reflect recovery from carry-over effects rather than durable training-induced belief shifts. Results are qualified by design and measurement limitations (notably an AM/PM session confound, differential attrition, and sensitivity of LLM grading to document length).

Key Points

  • Sample and design

    • N = 388 full-time employees from Gap Inc., organized into 194 pairs (97 pairs per randomized arm).
    • Two sequential tasks in one day:
      • Task A (pair task, 30 minutes): one-page AI Adoption Action Plan (anti-generic constraint).
      • Task B (individual task): strategic communications responses to three AI-related stakeholder concerns.
    • Both arms had identical access to Microsoft Copilot; randomization assigned pairs to:
      • Control: naturalistic AI use (Task A) + basic Copilot training (Task B).
      • Treatment: structured “Create-Out-Loud” joint protocol (Task A) + partnership (thought-partner) training (Task B).
    • Important design constraint: AM session = Control, PM session = Treatment (session timing confounded with treatment).
  • Primary outcomes

    • Document quality graded with GPT-4o-mini (LLM-as-judge), with rubrics:
      • Task A max = 22 points (four dimensions).
      • Task B max = 20 points (four dimensions).
    • Human validation on a stratified sample; cross-model validation (GPT-4o-mini vs GPT-4o) showed high agreement (Pearson r = 0.92, ICC = 0.92).
  • Main empirical results

    • Behavioral scaffolding (structured joint protocol):
      • Associated with substantially lower document production (pairs less likely to produce a document).
      • Among documents produced, average pair document quality was lower than for naturalistic pairs.
      • Compliance was heterogeneous: many treatment pairs faced logistical/technical barriers; post-hoc compliance groups included “Stranded,” “Parallel Play,” and “True Joint” (only a minority achieved True Joint).
    • Cognitive scaffolding (partnership training):
      • Associated with higher individual document quality at the top of the distribution (improved upper-tail performance), but limited average effect.
    • Belief change:
      • Treatment participants showed greater positive belief movement across the session on several measures, but sensitivity/specification checks suggest this likely reflects recovery from Task A carry-over differences rather than a robust training-induced shift.
  • Robustness and sensitivity

    • Analytic approach: intent-to-treat (ITT) OLS with HC2 robust SEs for pair-level analyses; CR2 clustered SEs for individual-level outcomes; adjustment for covariates.
    • Addressed differential attrition with Lee (2009) trimming bounds and modeled non-production as an outcome.
    • Assessed LLM grading sensitivity to document length via Oster bounds and causal mediation (ACME/ADE) decomposition.
    • Calibrated possible session/circadian effects against literature because treatment collinear with AM/PM.
  • Key limitations (reported by authors)

    • AM/PM session confound (treatment = PM) threatens causal interpretation if time-of-day affects performance.
    • Differential attrition / non-production across arms.
    • LLM grading can be sensitive to document length; grading validity was checked but remains a possible source of bias.
    • Compliance to the behavioral protocol was uneven and post-hoc classified (selection after randomization).

Data & Methods

  • Participants: 388 full-time employees with Copilot access; pairs created by constrained matching on functional area, baseline AI comfort, job level, participation mode.
  • Randomization: stratified by functional area with covariate balance checks; however, sessions were split (AM control, PM treatment).
  • Tasks:
    • Task A: pair-level one-page action plan, 30-min cap, anti-generic constraint; one shared document per pair.
    • Task B: individual-level strategic communications responses.
  • Interventions:
    • Behavioral scaffold: synchronous Teams meeting, oral transcript, then explicit Copilot prompt to draft (Create-Out-Loud).
    • Cognitive scaffold: partnership training (AI Mindset-style, iterative prompting, “thought partner” framing).
  • Outcomes and measurement:
    • Primary: document quality graded by GPT-4o-mini (three independent grades per document; median used).
    • Validation: cross-model grading (GPT-4o) and human rater sample.
    • Surveys: perceived productivity/flow, Copilot helpfulness, future AI intent; belief inventories measured post-Task A and post-Task B.
  • Analysis:
    • ITT estimands.
    • OLS with HC2 robust SEs for pair-level Task A; OLS with CR2 clustered SEs at pair-level for individual Task B and surveys.
    • Multiple comparisons: Benjamini–Hochberg within outcome families.
    • Sensitivity analyses: Lee bounds for attrition, Oster bounds and mediation for word-count effects, calibration of session-effect magnitude against circadian literature.
    • Compliance: post-hoc classification into Stranded / Parallel Play / True Joint used descriptively (no causal inference from compliance subgroups).

Implications for AI Economics

  • Structural constraints can backfire. Mandated, synchronous joint-AI protocols introduce coordination costs that may outweigh aggregation benefits in many real-world settings. Firms should be cautious imposing rigid collaborative AI workflows without ensuring reliable coordination infrastructure and clear complementary information structures.
  • Cognitive framing is promising but modest. Training that reframes AI as a “thought partner” can improve upper-tail individual performance (helping some workers realize gains), suggesting investment in mental-model interventions can increase the likelihood that some employees extract more value from GenAI. However, average effects may be small—organizations should set realistic expectations and consider targeted training for users most likely to benefit.
  • Adoption policy should be context-sensitive. Whether behavioral scaffolds help depends on task characteristics (need for integration vs. independent generation), infrastructure reliability, and whether the AI mediator can synthesize cross-person context. The trade-off B − C (knowledge aggregation benefits minus coordination costs) should guide design.
  • Measurement matters. LLM-based grading provides scalable quality measurement but can be sensitive to document length and other artifacts. Cross-validation with human raters and sensitivity analyses are essential when using AI judges in field studies of AI productivity.
  • Research and evaluation priorities for AI economics:
    • Test scaffold designs across different tasks, teams, and reliable infrastructure settings to map where behavioral protocols are beneficial versus harmful.
    • Evaluate heterogeneous treatment effects to identify which worker segments disproportionately benefit from cognitive framing.
    • Reduce confounds (e.g., avoid session timing collinearity) and pre-register designs to strengthen causal claims.
    • Incorporate extensive checks for attrition and protocol compliance; measure both intensive (quality) and extensive (document production/adoption) margins.

Practical takeaway for organizations: don’t assume that forcing joint AI workflows will automatically improve collaborative outcomes—ensure the coordination costs are low (technology, scheduling, norms) or prioritize cognitive scaffolding and targeted training to raise the capacity of individuals to engage iteratively with AI.

Assessment

Paper Typerct Evidence Strengthmedium — A field randomized design with a substantial sample and pre/post measures provides credible causal leverage on how scaffolding affects human–AI use and short-run productivity; however, important design problems (an AM/PM session confound, differential attrition, and sensitivity of automated LLM grading to document length) weaken internal validity and therefore the strength of the evidence. Methods Rigormedium — The study uses pre-registered (implied) experimental variation and appropriate comparisons, plus sensitivity checks, but suffers from avoidable procedural confounds (session timing), attrition imbalance, potential clustering by pair, and outcome measurement issues (LLM grading bias), which reduce the rigor of inference and require caution in interpretation. Sample388 employees from a single Fortune 500 retail firm participated in a field experiment producing documents with a shared generative-AI tool; some tasks were done individually, others in assigned pairs (behavioral arm); outcomes include document quality (graded by an LLM) and quantity, plus within-session belief measures. Themeshuman_ai_collab productivity skills_training org_design IdentificationRandomized assignment of 388 employees into two intervention arms (behavioral scaffolding: structured joint AI use in pairs; cognitive scaffolding: partnership training) and an unstructured-use control, with all participants given access to the same generative AI tool; causal claims rest on this randomization (with analyses comparing pre/post measures and outcomes across arms). GeneralizabilitySingle firm (Fortune 500 retailer) — may not generalize to other industries or firm sizes, Specific employee population and task (document production) — results may differ for other tasks or skill levels, Short-term, within-session effects — unclear persistence over time, Interventions tied to one AI tool and one implementation protocol — effects may vary with different models or interfaces, Design confounds (AM/PM timing, attrition) limit external validity

Claims (7)

ClaimDirectionConfidenceOutcomeDetails
A behavioral scaffolding intervention (a structured protocol requiring joint AI use within pairs) was associated with lower document quality relative to unstructured use. Output Quality negative high document quality
n=388
0.6
The behavioral scaffolding intervention was associated with substantially lower document production. Developer Productivity negative high document production (quantity of documents produced)
n=388
0.6
A cognitive scaffolding intervention (partnership training that reframed AI as a thought partner) was associated with higher individual document quality at the top of the distribution. Output Quality positive high individual document quality (top of the distribution)
n=388
0.6
Participants in the treatment conditions showed greater positive belief change about the AI across the session. Worker Satisfaction positive high change in participant beliefs about AI (pre/post)
n=388
0.3
Sensitivity analyses indicate the observed positive belief changes likely reflect recovery from carry-over effects rather than genuine training-induced shifts. Worker Satisfaction mixed high validity of belief-change effect (source attribution: training vs. carry-over recovery)
n=388
0.3
All participants had access to the same AI tool; the experiment varied only the structure surrounding its use (behavioral vs cognitive scaffolding vs unstructured). Other null_result high experimental manipulation fidelity (same AI tool across conditions)
n=388
1.0
The study's findings are subject to design limitations including an AM/PM session confound, differential attrition, and LLM grading sensitivity to document length. Governance And Regulation negative high threats to validity (confounds and measurement sensitivity)
n=388
1.0

Notes