Large language models subtly but systematically change what people mean when they write: heavy LLM users produce nearly 70% more neutral answers and report less creative, less ‘in‑their‑voice’ prose. When asked to revise human essays or write peer reviews, LLMs frequently alter semantics and give reviews that weight clarity/significance less and score papers roughly one point higher on average.
Large language models (LLMs) are used by over a billion people globally, most often to assist with writing. In this work, we demonstrate that LLMs not only alter the voice and tone of human writing, but also consistently alter the intended meaning. First, we conduct a human user study to understand how people actually interact with LLMs when using them for writing. Our findings reveal that extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question. Significantly more heavy LLM users reported that the writing was less creative and not in their voice. Next, using a dataset of human-written essays that was collected in 2021 before the widespread release of LLMs, we study how asking an LLM to revise the essay based on the human-written feedback in the dataset induces large changes in the resulting content and meaning. We find that even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning. We then examine LLM-generated text in the wild, specifically focusing on the 21% of AI-generated scientific peer reviews at a recent top AI conference. We find that LLM-generated reviews place significantly less weight on clarity and significance of the research, and assign scores that, on average, are a full point higher.These findings highlight a misalignment between the perceived benefit of AI use and an implicit, consistent effect on the semantics of human writing, motivating future work on how widespread AI writing will affect our cultural and scientific institutions.
Summary
Main Finding
LLMs used as writing assistants systematically change not just surface style but the intended semantic content of human writing. In experiments and large-scale analyses the authors show that (a) heavy LLM use drives essays toward neutral/altered argumentative stances and reduced perceived creativity/voice, (b) LLM edits — even under “grammar” or minimal-edit instructions — induce large, consistent semantic shifts away from human revisions, and (c) LLM-generated peer reviews shift evaluative criteria and give substantially higher scores (≈1 point on average), indicating institutional-level effects.
Code / project page: https://github.com/abdulhaim/llm_writing_distortion and https://sites.google.com/view/llmwritingdistortion/home
Key Points
- Human randomized trial (N=100): participants writing on “Does money lead to happiness?”; control vs. AI-assisted (gpt-4o-mini).
- Extensive LLM use produced a ~70% increase in essays that became neutral in stance (vs for/against).
- Heavy users reported less creativity and that writing was “not in their voice.”
- Counterfactual editing on ArgRewrite-v2 (86 essays, pre-ChatGPT 2021):
- Prompting three production LLMs (gpt-5-mini, gemini-2.5-flash, claude-haiku) to revise human drafts using the same expert feedback produced large semantic shifts.
- Even “minimal” or “grammar” edit prompts caused substantial, aligned shifts in semantic embedding space; LLM revisions clustered in a region not occupied by human edits.
- LLM edits increased argumentative/analytical language and roughly doubled both positive and negative sentiment.
- Lexical and POS distributions, stylistic features, and emotional tone diverged from human edits.
- Peer-review analysis (ICLR 2026):
- ~21% of peer reviews were LLM-generated or heavily edited with LLMs.
- LLM reviews emphasized reproducibility/scalability/practical application more and clarity/significance less.
- LLM reviews assigned higher scores on average (about +1 point).
- Overall conclusion: LLM assistance introduces an implicit, consistent bias in semantics and evaluation, producing homogenization and potential institutional shifts.
Data & Methods
- Human study:
- N=100 participants recruited via Prolific; native English speakers in the U.S.; IRB-approved.
- Random assignment: control (no LLM, n=45) vs. AI-assisted (n=55) with interaction transcripts recorded.
- Post- and pre-study questionnaires measured self-reported creativity, voice alignment, and attitudes.
- Classification into LLM-Influenced vs LLM (extensive generative use) based on transcripts and self-report (threshold: <40% generated = LLM-influenced).
- ArgRewrite-v2 counterfactual edits:
- 86 argumentative essays written in 2021 with expert feedback (human D1 → human D2).
- Prompted three LLMs to produce D2 variants under five revision types: general, minimal, grammar, completion, expansion.
- Measured differences across dimensions: semantic distance (sentence embeddings MiniLM-L6-v2, PCA visualizations), lexical frequency/unique words, POS distributions, sentiment/emotion metrics, stylistic features.
- Peer-review analysis:
- ICLR 2026 reviews; identified LLM-generated/edited reviews (prior work reporting ≈21%).
- Comparative content analysis of strengths/weaknesses and score distributions between human and LLM reviews.
- Models tested: gpt-5-mini, gemini-2.5-flash, claude-haiku, gpt-4o-mini (user study assistant).
- Quantitative highlights reported by authors: ~70% shift to neutrality among heavy users; ≈1 point higher scores from LLM reviews; doubling of sentiment intensity in edits.
Implications for AI Economics
- Externalities and collective-welfare losses
- Homogenization of expression and argumentation represents a negative externality: individual users gain productivity but collective cultural and informational diversity erodes, potentially reducing innovation and robustness of deliberation.
- In scientific markets, systematic upward scoring and changed evaluation criteria (favoring reproducibility/scalability over clarity/significance) can misallocate attention, funding, and career rewards — altering research agendas and creating path-dependencies.
- Market and incentive effects
- Firms, researchers, and students face a social-dilemma: individual adoption increases measurable productivity/citations (short-term private gain) but contributes to a collective contraction in topic diversity and idea variety (long-term public loss). This can produce equilibrium over-adoption of LLM co-writing despite social harms.
- Platform and model providers may inadvertently amplify these distortions via training/feedback loops (LLM outputs retrained into future data), generating cultural lock-in and reinforcing the “mono-voice” economically.
- Information quality and allocation of scarce resources
- If LLM-influenced reviews systematically inflate scores, peer-review and grant-evaluation markets may become noisier or biased, affecting allocation efficiency across fields and projects.
- Convergence toward certain evaluative criteria changes selection pressures in research ecosystems (e.g., favoring engineering/practicality over foundational novelty).
- Labor and human capital
- Writers, editors, and evaluators risk deskilling and loss of unique human signals (voice, judgment), affecting labor demand for high-skill writing/editing and possibly compressing wage premia tied to distinctive analytic judgement.
- Policy and design responses (economic levers)
- Internalize externalities: consider disclosure/labeling requirements for AI-assisted text in domains where decision quality matters (academia, policy, legal, medical).
- Incentive alignment: fund or reward diversity-preserving behaviors (grants for human-authored work, meta-research on semantic drift), or adjust evaluation metrics to penalize detectable homogenization.
- Platform design: encourage tools that constrain semantic drift (editing modes that prioritize semantic fidelity, explainable edit-deltas), or expose edit provenance to downstream consumers.
- Market regulation: provenance and auditability standards, plus public investment in detection/mitigation R&D to prevent feedback-loop reinforcement in training corpora.
- Research needs for economic modeling
- Quantify welfare trade-offs between productivity gains and diversity losses; model dynamic feedback loops where LLM outputs re-enter training data (endogenous cultural evolution).
- Empirical work on long-run effects on innovation rates, topic breadth, and allocation efficiency in research and media markets.
- Design and test mechanism interventions (subsidies, taxes, disclosure rules, platform incentives) to correct for collective-welfare harms.
- Short practical takeaway for economists and policymakers
- Treat widespread LLM-assisted writing as an innovation with substantial non-rival negative externalities on cultural and informational ecosystems. Measuring and internalizing those externalities — via disclosure rules, evaluation redesign, and incentives for diversity — should be a priority to avoid inefficient institutional drift.
If you’d like, I can (a) extract the exact quantitative tables/figures and metrics the paper reports, (b) sketch a formal economic model of the social dilemma the authors describe, or (c) propose concrete policy interventions and how to evaluate them empirically.
Assessment
Claims (11)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Large language models (LLMs) are used by over a billion people globally, most often to assist with writing. Adoption Rate | positive | high | LLM adoption and primary use case (writing assistance) |
over a billion people
0.48
|
| LLMs alter the voice and tone of human writing. Output Quality | positive | high | change in voice and tone of writing |
0.48
|
| LLMs consistently alter the intended meaning of human writing. Output Quality | positive | high | degree of semantic change / alteration of intended meaning |
0.48
|
| In a human user study, extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question. Output Quality | positive | high | proportion of essays judged as neutral in answering the topic question |
nearly 70% increase
0.48
|
| Significantly more heavy LLM users reported that the writing was less creative and not in their voice. Creativity | negative | high | self-reported creativity and 'in-your-voice' authenticity of writing |
0.48
|
| Using a dataset of human-written essays (collected in 2021 before widespread LLM release), asking an LLM to revise essays based on human-written feedback induces large changes in the resulting content and meaning. Output Quality | positive | high | magnitude of content and semantic changes after LLM revision |
large changes
0.48
|
| Even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning. Output Quality | positive | high | semantic alteration of text despite constrained grammar-only prompt |
0.48
|
| About 21% of scientific peer reviews at a recent top AI conference were AI-generated (LLM-generated) in the wild. Adoption Rate | positive | high | share/proportion of peer reviews that were AI-generated |
21%
0.48
|
| LLM-generated peer reviews place significantly less weight on clarity and significance of the research. Decision Quality | negative | high | importance/weight given to clarity and significance in peer review content |
0.48
|
| LLM-generated peer reviews assign scores that, on average, are a full point higher than human reviews. Decision Quality | positive | high | assigned review scores |
a full point higher
0.48
|
| These findings indicate a misalignment between the perceived benefit of AI writing and an implicit, consistent effect on the semantics of human writing, with potential implications for cultural and scientific institutions. Governance And Regulation | mixed | high | alignment between perceived benefits and actual semantic effects of AI writing; institutional impact |
0.08
|