Large language models subtly but systematically change what people mean when they write: heavy LLM users produce nearly 70% more neutral answers and report less creative, less ‘in‑their‑voice’ prose. When asked to revise human essays or write peer reviews, LLMs frequently alter semantics and give reviews that weight clarity/significance less and score papers roughly one point higher on average.

How LLMs Distort Our Written Language

Marwa Abdulhai, Isadora White, Yanming Wan, Ibrahim Qureshi, Joel Leibo, Max Kleiman-Weiner, Natasha Jaques · March 18, 2026

arxiv quasi_experimental medium evidence 7/10 relevance Source PDF

LLM use systematically alters writers’ voice and semantic meaning—heavy users produce far more neutral answers and edits by LLMs change essay meaning even under ‘grammar-only’ prompts—while AI-generated peer reviews tend to emphasize different criteria and give on average one point higher scores.

Large language models (LLMs) are used by over a billion people globally, most often to assist with writing. In this work, we demonstrate that LLMs not only alter the voice and tone of human writing, but also consistently alter the intended meaning. First, we conduct a human user study to understand how people actually interact with LLMs when using them for writing. Our findings reveal that extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question. Significantly more heavy LLM users reported that the writing was less creative and not in their voice. Next, using a dataset of human-written essays that was collected in 2021 before the widespread release of LLMs, we study how asking an LLM to revise the essay based on the human-written feedback in the dataset induces large changes in the resulting content and meaning. We find that even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning. We then examine LLM-generated text in the wild, specifically focusing on the 21% of AI-generated scientific peer reviews at a recent top AI conference. We find that LLM-generated reviews place significantly less weight on clarity and significance of the research, and assign scores that, on average, are a full point higher.These findings highlight a misalignment between the perceived benefit of AI use and an implicit, consistent effect on the semantics of human writing, motivating future work on how widespread AI writing will affect our cultural and scientific institutions.

Summary

Main Finding

LLMs used as writing assistants systematically change not just surface style but the intended semantic content of human writing. In experiments and large-scale analyses the authors show that (a) heavy LLM use drives essays toward neutral/altered argumentative stances and reduced perceived creativity/voice, (b) LLM edits — even under “grammar” or minimal-edit instructions — induce large, consistent semantic shifts away from human revisions, and (c) LLM-generated peer reviews shift evaluative criteria and give substantially higher scores (≈1 point on average), indicating institutional-level effects.

Code / project page: https://github.com/abdulhaim/llm_writing_distortion and https://sites.google.com/view/llmwritingdistortion/home

Key Points

Human randomized trial (N=100): participants writing on “Does money lead to happiness?”; control vs. AI-assisted (gpt-4o-mini).
- Extensive LLM use produced a ~70% increase in essays that became neutral in stance (vs for/against).
- Heavy users reported less creativity and that writing was “not in their voice.”
Counterfactual editing on ArgRewrite-v2 (86 essays, pre-ChatGPT 2021):
- Prompting three production LLMs (gpt-5-mini, gemini-2.5-flash, claude-haiku) to revise human drafts using the same expert feedback produced large semantic shifts.
- Even “minimal” or “grammar” edit prompts caused substantial, aligned shifts in semantic embedding space; LLM revisions clustered in a region not occupied by human edits.
- LLM edits increased argumentative/analytical language and roughly doubled both positive and negative sentiment.
- Lexical and POS distributions, stylistic features, and emotional tone diverged from human edits.
Peer-review analysis (ICLR 2026):
- ~21% of peer reviews were LLM-generated or heavily edited with LLMs.
- LLM reviews emphasized reproducibility/scalability/practical application more and clarity/significance less.
- LLM reviews assigned higher scores on average (about +1 point).
Overall conclusion: LLM assistance introduces an implicit, consistent bias in semantics and evaluation, producing homogenization and potential institutional shifts.

Data & Methods

Human study:
- N=100 participants recruited via Prolific; native English speakers in the U.S.; IRB-approved.
- Random assignment: control (no LLM, n=45) vs. AI-assisted (n=55) with interaction transcripts recorded.
- Post- and pre-study questionnaires measured self-reported creativity, voice alignment, and attitudes.
- Classification into LLM-Influenced vs LLM (extensive generative use) based on transcripts and self-report (threshold: <40% generated = LLM-influenced).
ArgRewrite-v2 counterfactual edits:
- 86 argumentative essays written in 2021 with expert feedback (human D1 → human D2).
- Prompted three LLMs to produce D2 variants under five revision types: general, minimal, grammar, completion, expansion.
- Measured differences across dimensions: semantic distance (sentence embeddings MiniLM-L6-v2, PCA visualizations), lexical frequency/unique words, POS distributions, sentiment/emotion metrics, stylistic features.
Peer-review analysis:
- ICLR 2026 reviews; identified LLM-generated/edited reviews (prior work reporting ≈21%).
- Comparative content analysis of strengths/weaknesses and score distributions between human and LLM reviews.
Models tested: gpt-5-mini, gemini-2.5-flash, claude-haiku, gpt-4o-mini (user study assistant).
Quantitative highlights reported by authors: ~70% shift to neutrality among heavy users; ≈1 point higher scores from LLM reviews; doubling of sentiment intensity in edits.

Implications for AI Economics

Externalities and collective-welfare losses
- Homogenization of expression and argumentation represents a negative externality: individual users gain productivity but collective cultural and informational diversity erodes, potentially reducing innovation and robustness of deliberation.
- In scientific markets, systematic upward scoring and changed evaluation criteria (favoring reproducibility/scalability over clarity/significance) can misallocate attention, funding, and career rewards — altering research agendas and creating path-dependencies.
Market and incentive effects
- Firms, researchers, and students face a social-dilemma: individual adoption increases measurable productivity/citations (short-term private gain) but contributes to a collective contraction in topic diversity and idea variety (long-term public loss). This can produce equilibrium over-adoption of LLM co-writing despite social harms.
- Platform and model providers may inadvertently amplify these distortions via training/feedback loops (LLM outputs retrained into future data), generating cultural lock-in and reinforcing the “mono-voice” economically.
Information quality and allocation of scarce resources
- If LLM-influenced reviews systematically inflate scores, peer-review and grant-evaluation markets may become noisier or biased, affecting allocation efficiency across fields and projects.
- Convergence toward certain evaluative criteria changes selection pressures in research ecosystems (e.g., favoring engineering/practicality over foundational novelty).
Labor and human capital
- Writers, editors, and evaluators risk deskilling and loss of unique human signals (voice, judgment), affecting labor demand for high-skill writing/editing and possibly compressing wage premia tied to distinctive analytic judgement.
Policy and design responses (economic levers)
- Internalize externalities: consider disclosure/labeling requirements for AI-assisted text in domains where decision quality matters (academia, policy, legal, medical).
- Incentive alignment: fund or reward diversity-preserving behaviors (grants for human-authored work, meta-research on semantic drift), or adjust evaluation metrics to penalize detectable homogenization.
- Platform design: encourage tools that constrain semantic drift (editing modes that prioritize semantic fidelity, explainable edit-deltas), or expose edit provenance to downstream consumers.
- Market regulation: provenance and auditability standards, plus public investment in detection/mitigation R&D to prevent feedback-loop reinforcement in training corpora.
Research needs for economic modeling
- Quantify welfare trade-offs between productivity gains and diversity losses; model dynamic feedback loops where LLM outputs re-enter training data (endogenous cultural evolution).
- Empirical work on long-run effects on innovation rates, topic breadth, and allocation efficiency in research and media markets.
- Design and test mechanism interventions (subsidies, taxes, disclosure rules, platform incentives) to correct for collective-welfare harms.
Short practical takeaway for economists and policymakers
- Treat widespread LLM-assisted writing as an innovation with substantial non-rival negative externalities on cultural and informational ecosystems. Measuring and internalizing those externalities — via disclosure rules, evaluation redesign, and incentives for diversity — should be a priority to avoid inefficient institutional drift.

If you’d like, I can (a) extract the exact quantitative tables/figures and metrics the paper reports, (b) sketch a formal economic model of the social dilemma the authors describe, or (c) propose concrete policy interventions and how to evaluate them empirically.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper triangulates three complementary data sources (user study, controlled LLM re-writing of pre-LLM essays, and analysis of conference reviews) producing consistent patterns, which strengthens inference; however, causal claims are limited by self-selection in the user survey, potential model- and prompt-specific effects in the writing experiments, and possible measurement/classification error in identifying AI-generated reviews. Methods Rigormedium — Methods combine human-subject survey work, systematic prompt-based LLM interventions on an out-of-time essay corpus, and naturalistic review comparisons, which is methodologically thoughtful; but rigor is reduced by absence of randomization, limited detail on sampling and robustness checks (as reported here), potential confounds in observational comparisons, and dependence on semantic-change metrics whose sensitivity and validity are not fully described. SampleThree empirical components: (1) a human user study surveying LLM users (composition and sample size not specified here) comparing heavy vs light users’ perceptions and writing outcomes; (2) a dataset of human-written essays collected in 2021 (pre-widespread LLM usage) used to run controlled experiments where LLMs revise essays given human feedback or grammar-only prompts; (3) an analysis of peer reviews from a recent top AI conference in which ~21% of reviews were identified as AI-generated, used to compare content focus and scores between human and AI reviews. Themeshuman_ai_collab governance IdentificationCombination of observational user survey comparing heavy vs light LLM users, controlled text-intervention experiments where pre-LLM essays are revised by LLMs under specific prompts (e.g., ‘only fix grammar’ or apply provided expert feedback), and comparative analysis of naturally occurring peer reviews (human vs LLM-identified) at a recent conference; identification relies on cross-group comparisons and within-text pre/post changes rather than randomized assignment or instrumental variation. GeneralizabilitySelf-selection and unobserved differences between heavy and light LLM users limit causal generalization to all writers, Findings may depend on the specific LLM(s), prompts, and prompt-engineering choices used and may not generalize across models or future versions, Essay dataset from 2021 may reflect specific genres (e.g., student essays) and English-language conventions, limiting cross-linguistic or cross-genre applicability, Peer-review findings come from a single (top) AI conference and may not generalize to other fields, venues, or reviewer populations, Measures of ‘semantic change’, ‘creativity’, and ‘voice’ are partly subjective and could vary with annotation protocol and metrics

Claims (11)

Claim	Direction	Confidence	Outcome	Details
Large language models (LLMs) are used by over a billion people globally, most often to assist with writing. Adoption Rate	positive	high	LLM adoption and primary use case (writing assistance)	over a billion people 0.48
LLMs alter the voice and tone of human writing. Output Quality	positive	high	change in voice and tone of writing	0.48
LLMs consistently alter the intended meaning of human writing. Output Quality	positive	high	degree of semantic change / alteration of intended meaning	0.48
In a human user study, extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question. Output Quality	positive	high	proportion of essays judged as neutral in answering the topic question	nearly 70% increase 0.48
Significantly more heavy LLM users reported that the writing was less creative and not in their voice. Creativity	negative	high	self-reported creativity and 'in-your-voice' authenticity of writing	0.48
Using a dataset of human-written essays (collected in 2021 before widespread LLM release), asking an LLM to revise essays based on human-written feedback induces large changes in the resulting content and meaning. Output Quality	positive	high	magnitude of content and semantic changes after LLM revision	large changes 0.48
Even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning. Output Quality	positive	high	semantic alteration of text despite constrained grammar-only prompt	0.48
About 21% of scientific peer reviews at a recent top AI conference were AI-generated (LLM-generated) in the wild. Adoption Rate	positive	high	share/proportion of peer reviews that were AI-generated	21% 0.48
LLM-generated peer reviews place significantly less weight on clarity and significance of the research. Decision Quality	negative	high	importance/weight given to clarity and significance in peer review content	0.48
LLM-generated peer reviews assign scores that, on average, are a full point higher than human reviews. Decision Quality	positive	high	assigned review scores	a full point higher 0.48
These findings indicate a misalignment between the perceived benefit of AI writing and an implicit, consistent effect on the semantics of human writing, with potential implications for cultural and scientific institutions. Governance And Regulation	mixed	high	alignment between perceived benefits and actual semantic effects of AI writing; institutional impact	0.08