Large experiment finds a 'speedup illusion': users expect LLMs to be faster but actual completion times on simple tasks are unchanged, even as subjective effort falls; the bias is specific to AI and not seen when imagining human help.
Large language models (LLMs) have the potential to boost human productivity by speeding up task completion -- provided users know when to offload cognitive work to them. But we do not know if users are well-calibrated in estimating these potential time savings. We conducted a preregistered large-scale behavioral study (N = 1237) to characterize mismatches between expectations and reality, with a focus on simple cognitive tasks. While actual completion times between independent completion and AI-assisted completion did not differ, participants predicted AI to be significantly faster. The same bias was not observed when imagining help from another human participant. We identify a speedup illusion where people have accurate forecasts of independent completion times but significantly underestimate AI-assisted times. Additionally, time and effort dissociate: participants reported lower subjective effort with AI despite equivalent completion times. This suggests that completion time itself is not sufficient to characterize efficiency gains.
Summary
Main Finding
People systematically overestimate how much time large language models (LLMs) will save on simple cognitive tasks — a "speedup illusion." In a preregistered behavioral study (N = 1,237), predicted AI-assisted completion times were much shorter than people’s actual AI-assisted times, whereas predictions of independent (no-AI) completion times were well calibrated. Although AI rarely reduced actual completion time for easy tasks, it consistently reduced subjective mental effort.
Key Points
-
Sample and design
- Total N = 1,237 (prediction sample n = 401, completion sample n = 836) recruited on Prolific (US-representative); preregistered; IRB-approved; data/code available.
- Between-subject design: separate prediction vs. completion samples to avoid contamination.
- Tasks: 24 tasks (4 categories: C1 Information Seeking, C2 Processing & Synthesis, C3 Procedural Guidance & Execution, C4 Content Creation & Transformation), each at two difficulty levels (easy / difficult). Only correct, high-quality responses were analyzed (excluded 6.3% independent, 4.0% AI responses).
-
Main quantitative results
- People expected AI to reduce completion time by ~68.5 seconds relative to independent completion (β = 68.5, SE = 3.37, p < 0.001).
- Predicted vs. actual calibration:
- Independent condition: no reliable difference between predicted and actual completion times (β = −2.52, SE = 8.83, p = 0.775) — i.e., good calibration.
- AI condition: actual AI-assisted completion time was significantly longer than predicted AI-assisted time (difference ≈ +57.8 seconds; β = 57.8, SE = 8.79, p < 0.001) — the speedup illusion.
- Actual time savings: AI assistance sped up only difficult tasks (mean ≈ 26.1 seconds faster; SE = 11.9, p < 0.05) and produced significant time savings on only 3 of the 24 tasks.
- Subjective effort (NASA-TLX, 5-item version):
- AI reduced perceived effort across tasks: average NASA-TLX decreased by 0.61 points on a 7-point scale (SE = 0.059, t = −10.38, p < 0.001).
- 15 out of 24 tasks showed significant reductions in subjective effort with AI assistance.
- Individual differences:
- Lower Need for Cognition (NfC) — i.e., people who dislike thinking — predicted larger predicted AI speedup (more susceptible to the speedup illusion).
- Frequency of AI use and general AI assessments did not predict calibration error.
- Prompting and interaction
-
70% of user–LLM interactions were single-turn; max observed turns = 5.
- Model generation time was small (≈ 2.89 seconds).
- Prompt composition time and post-response processing time were similar on average, but varied by task:
- Prompting took ~33.4 seconds longer than post-processing for C2 tasks.
- For some tasks (e.g., a logic problem), post-response reading/processing was much longer (≈ 114 seconds longer), driven by verbose or hard-to-parse model outputs.
- Copy–pasted prompts (18.5% of prompts) reduced NASA-TLX slightly (~0.14) but did not shorten completion time.
-
Data & Methods
- Two-sample design:
- Prediction sample: participants estimated how long each task would take independently and with external assistance (AI or another highly intelligent human). They also stated whether they'd choose to offload.
- Completion sample: participants either completed tasks independently or with AI assistance (embedded GPT‑4o chat); hidden timers recorded completion times; NASA-TLX measured subjective effort after each task.
- Analyses:
- Linear mixed-effects models with random intercepts for participant and task; main contrasts: (prediction vs. completion) × (independent vs. AI-assisted) controlling for task difficulty and task category.
- Focused analyses on correct, high-quality responses.
- Key robustness & exclusions:
- Pre-registration and IRB.
- Excluded low-effort or incorrect responses; reported proportions excluded.
- Sample balanced on demographics (reported percentages).
Implications for AI Economics
- Productivity measurement: time-to-completion alone can misrepresent AI-driven productivity. LLMs may not reduce objective time on many simple tasks but do reduce subjective effort. Economic assessments of AI productivity gains should incorporate both objective time and subjective/psychic costs (effort, cognitive load).
- Adoption dynamics and demand:
- Miscalibrated user beliefs (speedup illusion) can inflate adoption of LLMs for tasks where they do not actually save time. This may create a self-reinforcing feedback loop: people use AI because it feels less effortful and expect time savings, which further normalizes offloading even absent objective efficiency gains.
- Models of diffusion and firm adoption should allow for demand driven by perceived (not realized) gains and by reductions in subjective effort that alter labor allocation or task willingness.
- Labor-market effects:
- Substitution vs. augmentation is heterogeneous by task complexity. AI produced measurable time savings primarily on harder tasks — suggesting complementarity for complex tasks and limited substitution for simple ones.
- Lower subjective effort with unchanged completion times could change workers' willingness to perform certain tasks (increasing labor supply for some activities) or enable higher throughput in longer workflows where subjective effort constraints were binding even if per-task timing unchanged.
- Measurement & policy recommendations:
- When estimating macro or firm-level productivity impacts, include measures of interaction overhead (prompting, processing verbose outputs), post-response cognitive processing, and user calibration errors.
- Interface and transparency interventions (e.g., showing expected vs. empirical time-to-complete, concise model outputs, better summarization) may reduce miscalibration and improve resource-rational offloading decisions.
- Training and nudges targeted at users with low Need for Cognition may be especially important, since they are more prone to overestimating AI time savings.
- Research and modeling suggestions:
- Incorporate subjective effort reductions as a separate utility term in economic models of AI adoption and labor supply.
- Allow adoption to be influenced by beliefs that may diverge from realized productivity gains and study dynamic belief-updating from repeated use.
- Account for heterogeneity in task-level returns: estimate distributions of tH(τ) and tA(τ) by task complexity, and include interaction costs (prompting + post-processing).
Caveats - Tasks were short, discrete cognitive tasks in an experimental setting using GPT‑4o; generalization to complex, multi-step real-world workflows, other LLMs, or integrated systems may be limited. - Between-subject design avoids practice effects but does not measure within-person calibration change after feedback. - The study filtered for correct, high-quality responses — results concern tasks where outcomes were achieved, not cases with incorrect AI output or error correction costs.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Large language models (LLMs) have the potential to boost human productivity by speeding up task completion -- provided users know when to offload cognitive work to them. Task Completion Time | positive | high | task completion speed (potential) |
0.1
|
| We conducted a preregistered large-scale behavioral study (N = 1237) to characterize mismatches between expectations and reality, with a focus on simple cognitive tasks. Other | null_result | high | study design / sample size (methodological claim) |
n=1237
1.0
|
| Actual completion times between independent completion and AI-assisted completion did not differ. Task Completion Time | null_result | high | actual completion time |
n=1237
1.0
|
| Participants predicted AI to be significantly faster. Task Completion Time | positive | high | predicted completion time |
n=1237
1.0
|
| The same bias was not observed when imagining help from another human participant. Task Completion Time | null_result | high | predicted completion time when imagining help from another human |
n=1237
1.0
|
| There is a 'speedup illusion' where people have accurate forecasts of independent completion times but significantly underestimate AI-assisted times. Task Completion Time | negative | high | calibration of predicted vs actual completion time |
n=1237
1.0
|
| Time and effort dissociate: participants reported lower subjective effort with AI despite equivalent completion times. Worker Satisfaction | positive | high | subjective effort (self-reported); actual completion time also measured |
n=1237
1.0
|
| Completion time itself is not sufficient to characterize efficiency gains. Organizational Efficiency | mixed | high | adequacy of completion time as a measure of efficiency |
n=1237
0.6
|