Forcing workers to use generative AI together reduced both document output and average quality, while a short training that framed AI as a thought partner improved quality among the best performers; but session timing imbalances, dropout and automated grading quirks mean the results should be interpreted cautiously.
Organizations have widely deployed generative AI tools, yet productivity gains remain uneven, suggesting that how people use AI matters as much as whether they have access. We conducted a field experiment with 388 employees at a Fortune 500 retailer to test two scaffolding interventions for human-AI collaboration. All participants had access to the same AI tool; we varied only the structure surrounding its use. A behavioral scaffolding intervention (a structured protocol requiring joint AI use within pairs) was associated with lower document quality relative to unstructured use and substantially lower document production. A cognitive scaffolding intervention (partnership training that reframed AI as a thought partner) was associated with higher individual document quality at the top of the distribution. Treatment participants also showed greater positive belief change across the session, though sensitivity analyses suggest this likely reflects recovery from carry-over effects rather than genuine training-induced shifts. Both findings are subject to design limitations including an AM/PM session confound, differential attrition, and LLM grading sensitivity to document length.
Summary
Main Finding
A field experiment with 388 employees at a Fortune 500 retailer shows that the structure around AI use matters. A behavioral scaffold that mandated synchronous, joint AI use (the “Create-Out-Loud” protocol) reduced pair document production and lowered average pair document quality versus naturalistic use. A cognitive scaffold (partnership training that reframed AI as a “thought partner”) improved individual document quality at the top of the distribution, but did not produce a large average uplift. Reported positive belief changes for treatment participants likely reflect recovery from carry-over effects rather than durable training-induced belief shifts. Results are qualified by design and measurement limitations (notably an AM/PM session confound, differential attrition, and sensitivity of LLM grading to document length).
Key Points
-
Sample and design
- N = 388 full-time employees from Gap Inc., organized into 194 pairs (97 pairs per randomized arm).
- Two sequential tasks in one day:
- Task A (pair task, 30 minutes): one-page AI Adoption Action Plan (anti-generic constraint).
- Task B (individual task): strategic communications responses to three AI-related stakeholder concerns.
- Both arms had identical access to Microsoft Copilot; randomization assigned pairs to:
- Control: naturalistic AI use (Task A) + basic Copilot training (Task B).
- Treatment: structured “Create-Out-Loud” joint protocol (Task A) + partnership (thought-partner) training (Task B).
- Important design constraint: AM session = Control, PM session = Treatment (session timing confounded with treatment).
-
Primary outcomes
- Document quality graded with GPT-4o-mini (LLM-as-judge), with rubrics:
- Task A max = 22 points (four dimensions).
- Task B max = 20 points (four dimensions).
- Human validation on a stratified sample; cross-model validation (GPT-4o-mini vs GPT-4o) showed high agreement (Pearson r = 0.92, ICC = 0.92).
- Document quality graded with GPT-4o-mini (LLM-as-judge), with rubrics:
-
Main empirical results
- Behavioral scaffolding (structured joint protocol):
- Associated with substantially lower document production (pairs less likely to produce a document).
- Among documents produced, average pair document quality was lower than for naturalistic pairs.
- Compliance was heterogeneous: many treatment pairs faced logistical/technical barriers; post-hoc compliance groups included “Stranded,” “Parallel Play,” and “True Joint” (only a minority achieved True Joint).
- Cognitive scaffolding (partnership training):
- Associated with higher individual document quality at the top of the distribution (improved upper-tail performance), but limited average effect.
- Belief change:
- Treatment participants showed greater positive belief movement across the session on several measures, but sensitivity/specification checks suggest this likely reflects recovery from Task A carry-over differences rather than a robust training-induced shift.
- Behavioral scaffolding (structured joint protocol):
-
Robustness and sensitivity
- Analytic approach: intent-to-treat (ITT) OLS with HC2 robust SEs for pair-level analyses; CR2 clustered SEs for individual-level outcomes; adjustment for covariates.
- Addressed differential attrition with Lee (2009) trimming bounds and modeled non-production as an outcome.
- Assessed LLM grading sensitivity to document length via Oster bounds and causal mediation (ACME/ADE) decomposition.
- Calibrated possible session/circadian effects against literature because treatment collinear with AM/PM.
-
Key limitations (reported by authors)
- AM/PM session confound (treatment = PM) threatens causal interpretation if time-of-day affects performance.
- Differential attrition / non-production across arms.
- LLM grading can be sensitive to document length; grading validity was checked but remains a possible source of bias.
- Compliance to the behavioral protocol was uneven and post-hoc classified (selection after randomization).
Data & Methods
- Participants: 388 full-time employees with Copilot access; pairs created by constrained matching on functional area, baseline AI comfort, job level, participation mode.
- Randomization: stratified by functional area with covariate balance checks; however, sessions were split (AM control, PM treatment).
- Tasks:
- Task A: pair-level one-page action plan, 30-min cap, anti-generic constraint; one shared document per pair.
- Task B: individual-level strategic communications responses.
- Interventions:
- Behavioral scaffold: synchronous Teams meeting, oral transcript, then explicit Copilot prompt to draft (Create-Out-Loud).
- Cognitive scaffold: partnership training (AI Mindset-style, iterative prompting, “thought partner” framing).
- Outcomes and measurement:
- Primary: document quality graded by GPT-4o-mini (three independent grades per document; median used).
- Validation: cross-model grading (GPT-4o) and human rater sample.
- Surveys: perceived productivity/flow, Copilot helpfulness, future AI intent; belief inventories measured post-Task A and post-Task B.
- Analysis:
- ITT estimands.
- OLS with HC2 robust SEs for pair-level Task A; OLS with CR2 clustered SEs at pair-level for individual Task B and surveys.
- Multiple comparisons: Benjamini–Hochberg within outcome families.
- Sensitivity analyses: Lee bounds for attrition, Oster bounds and mediation for word-count effects, calibration of session-effect magnitude against circadian literature.
- Compliance: post-hoc classification into Stranded / Parallel Play / True Joint used descriptively (no causal inference from compliance subgroups).
Implications for AI Economics
- Structural constraints can backfire. Mandated, synchronous joint-AI protocols introduce coordination costs that may outweigh aggregation benefits in many real-world settings. Firms should be cautious imposing rigid collaborative AI workflows without ensuring reliable coordination infrastructure and clear complementary information structures.
- Cognitive framing is promising but modest. Training that reframes AI as a “thought partner” can improve upper-tail individual performance (helping some workers realize gains), suggesting investment in mental-model interventions can increase the likelihood that some employees extract more value from GenAI. However, average effects may be small—organizations should set realistic expectations and consider targeted training for users most likely to benefit.
- Adoption policy should be context-sensitive. Whether behavioral scaffolds help depends on task characteristics (need for integration vs. independent generation), infrastructure reliability, and whether the AI mediator can synthesize cross-person context. The trade-off B − C (knowledge aggregation benefits minus coordination costs) should guide design.
- Measurement matters. LLM-based grading provides scalable quality measurement but can be sensitive to document length and other artifacts. Cross-validation with human raters and sensitivity analyses are essential when using AI judges in field studies of AI productivity.
- Research and evaluation priorities for AI economics:
- Test scaffold designs across different tasks, teams, and reliable infrastructure settings to map where behavioral protocols are beneficial versus harmful.
- Evaluate heterogeneous treatment effects to identify which worker segments disproportionately benefit from cognitive framing.
- Reduce confounds (e.g., avoid session timing collinearity) and pre-register designs to strengthen causal claims.
- Incorporate extensive checks for attrition and protocol compliance; measure both intensive (quality) and extensive (document production/adoption) margins.
Practical takeaway for organizations: don’t assume that forcing joint AI workflows will automatically improve collaborative outcomes—ensure the coordination costs are low (technology, scheduling, norms) or prioritize cognitive scaffolding and targeted training to raise the capacity of individuals to engage iteratively with AI.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| A behavioral scaffolding intervention (a structured protocol requiring joint AI use within pairs) was associated with lower document quality relative to unstructured use. Output Quality | negative | high | document quality |
n=388
0.6
|
| The behavioral scaffolding intervention was associated with substantially lower document production. Developer Productivity | negative | high | document production (quantity of documents produced) |
n=388
0.6
|
| A cognitive scaffolding intervention (partnership training that reframed AI as a thought partner) was associated with higher individual document quality at the top of the distribution. Output Quality | positive | high | individual document quality (top of the distribution) |
n=388
0.6
|
| Participants in the treatment conditions showed greater positive belief change about the AI across the session. Worker Satisfaction | positive | high | change in participant beliefs about AI (pre/post) |
n=388
0.3
|
| Sensitivity analyses indicate the observed positive belief changes likely reflect recovery from carry-over effects rather than genuine training-induced shifts. Worker Satisfaction | mixed | high | validity of belief-change effect (source attribution: training vs. carry-over recovery) |
n=388
0.3
|
| All participants had access to the same AI tool; the experiment varied only the structure surrounding its use (behavioral vs cognitive scaffolding vs unstructured). Other | null_result | high | experimental manipulation fidelity (same AI tool across conditions) |
n=388
1.0
|
| The study's findings are subject to design limitations including an AM/PM session confound, differential attrition, and LLM grading sensitivity to document length. Governance And Regulation | negative | high | threats to validity (confounds and measurement sensitivity) |
n=388
1.0
|