Allowing ChatGPT on knowledge-based coursework improves short-term student performance in several classes, with more interactive use linked to modestly higher scores; students value speed and organization but warn of inaccuracies and overreliance, prompting calls for explicit AI-literacy instruction.

Expanding the lens: multi-institutional evidence on student use of ChatGPT in higher education

Miguel Á. Conde, Rocío García-Pascual, Francisco J. Rodríguez-Sedano, Jesús-Ángel Román-Gallego · March 10, 2026 · Universal Access in the Information Society

openalex rct medium evidence 7/10 relevance DOI Source PDF

Randomized access to ChatGPT raised short-term scores on knowledge-based course tasks in several courses, and greater iterative engagement with the tool (more edits) was weakly but significantly associated with higher performance, while students reported both benefits (speed, clarification, organization) and concerns (errors, overreliance).

Abstract This study investigates how university students engage with generative artificial intelligence (GenAI), specifically ChatGPT, when completing knowledge-based academic tasks across six courses and two institutions. By comparing performance and perceptions in engineering and non-engineering subjects, the study examines whether students can use GenAI effectively without prior training and to what extent such tools meaningfully support learning. The work also explores how these findings may inform future research on accessible and inclusive learning design. A multi-method design was employed with 254 undergraduate and graduate students assigned to experimental groups (allowed to use ChatGPT) or control groups (restricted to traditional, non-GenAI resources). Quantitative analyses included descriptive statistics, a general linear model, and non-parametric comparisons, complemented by a topic-based analysis of open-ended survey responses addressing students’ perceptions, usage patterns, and desired functionalities. Students in the experimental groups generally obtained higher scores, with significant improvements in several subjects (e.g., computer systems administration, informatics, childhood disorders). A weak but significant positive correlation emerged between iterative engagement with ChatGPT (edits) and academic performance. Qualitative analysis showed that students valued ChatGPT for fast information access, clarification of concepts, and organizational support, while also expressing concerns about inaccuracies, overreliance, and limitations of free versions. GenAI can enhance student performance when used actively and reflectively, although its effectiveness varies by disciplinary context. The findings highlight the need for explicit AI-literacy instruction to ensure critical and responsible use. While the study does not directly address disability or accessibility outcomes, the qualitative patterns suggest potential intersections with inclusive and multimodal learning design, pointing to promising avenues for future research.

Summary

Main Finding

Students given unrestricted access to ChatGPT generally scored higher on short, knowledge-based academic tasks than students restricted to traditional resources; gains were significant in several courses (e.g., Computer Systems Administration, Informatics, Childhood Disorders). Iterative, active engagement with the model (measured by edits/prompts) showed a weak but significant positive correlation with performance. Qualitative responses indicate students value speed, clarification, and organization support from ChatGPT, but worry about inaccuracies, over-reliance, and limits of free versions. The paper concludes GenAI can boost learning outcomes when used reflectively, but effectiveness varies by discipline and requires explicit AI-literacy instruction.

Key Points

Design: Multi-method, multi-institutional experimental study comparing experimental (allowed ChatGPT) vs control (no GenAI) conditions on identical knowledge questions.
Sample: 254 students (133 experimental, 121 control) across six courses at two Spanish universities; predominantly engineering students (219), 24% female in engineering courses.
No formal GenAI training given; under 7% reported prior instruction — study captures “naturalistic/untrained” use.
Assessment: identical question sets and rubrics for both groups to isolate effect of GenAI access; additional logging of interaction metrics (prompts/edits).
Quantitative analysis: descriptive stats, general linear model (GLM), and non-parametric tests. Main quantitative findings:
- Experimental group generally outperformed control.
- Significant improvements in specific subjects (e.g., CSA, INF, CD).
- Weak but statistically significant positive correlation between iterative engagement with ChatGPT and higher scores.
Qualitative analysis: topic-based coding of open-ended responses revealed:
- Perceived benefits: fast access to information, concept clarification, help with organization and synthesis.
- Perceived drawbacks: factual errors, risk of over-reliance, limited domain specificity, constraints of free model versions.
Practical notes: a few participants did not follow assigned conditions and were reclassified according to actual behavior; the activity was optional in some courses, creating group size imbalances.
Limitations acknowledged: disciplinary variation in GenAI utility (less effective for high-order analytical engineering tasks), gender and sample composition imbalances, no direct evidence on outcomes for students with disabilities (though implications for inclusive design are discussed).

Data & Methods

Participants: 254 undergraduate and graduate students from University of Salamanca and University of León across six courses (four engineering-related; two education/psychology-related).
Assignment: systematic sampling to experimental (n=133) or control (n=121); reclassification of a few misbehaving cases to reflect actual tool use.
Intervention: Experimental group could use ChatGPT (and other GenAI); control group restricted to non-GenAI materials (notes, textbooks, web resources without GenAI).
Tasks: Same set of short, knowledge-based questions per course; scoring rubric and conditions identical across arms.
Quantitative analyses:
- Descriptive statistics of scores by course and group.
- General Linear Model to assess effect of condition controlling for covariates.
- Non-parametric comparisons where appropriate.
- Correlation analysis between interaction metrics (number of edits/prompts) and task scores.
Qualitative analyses:
- Topic-based coding of open-ended survey items about perceived usefulness, reliability, usage patterns, and desired features.
Key robustness/validity choices: identical tests to isolate tool effect; naturalistic setting (no training) to observe unaided appropriation.

Implications for AI Economics

Human capital and productivity:
- Short-run productivity gains: Access to GenAI (ChatGPT) improves performance on certain knowledge tasks, implying potential near-term gains in student productivity and learning efficiency.
- Complementarity vs substitution: Gains depended on active, iterative use—suggesting GenAI acts as a complement to student effort and skill (those who use it reflectively benefit most) rather than a pure substitute for learning.
- Heterogeneous returns: Effect sizes vary by discipline; GenAI may raise returns to skills emphasizing synthesis and conceptual recall more than high-order analytical/problem-solving skills in STEM—affecting how human capital investment returns differ across fields.
Labor market and skill demand:
- Increased demand for AI-literacy: Educational institutions and employers will benefit from investing in AI-literacy training to realize GenAI complementarities; lack of instruction reduces potential gains and raises risks (misinformation, dependency).
- Credentialing and assessment redesign: If GenAI materially aids routine knowledge tasks, credentialing systems may need to shift toward assessments of higher-order, domain-specific reasoning and human-AI collaboration skills.
Access, equity, and market implications:
- Access matters: Widespread, often free access to GenAI can reduce time and search costs for students, but differential effective use (due to prior skills, training, gendered participation rates, or device access) could exacerbate inequalities unless AI-literacy and supervised deployment are scaled equitably.
- Market for complementary services: Positive outcomes without training highlight a baseline utility of GenAI, but the demonstrated benefits of iterative, skillful use suggest commercial opportunities for education providers offering structured AI-usage curricula, scaffolding tools, or domain-tuned models.
Public policy and institutional investment:
- Cost-effectiveness: GenAI can be a low-cost lever to improve learning outcomes in some domains, but net welfare gains depend on investment in instruction, oversight, and assessment redesign to mitigate misuse and inaccuracies.
- Regulation and standards: Findings strengthen the case for institutional policies on acceptable use, transparency of AI-assisted work, and standards for AI integration into graded activities.
Research and macro implications:
- Aggregate productivity: If similar improvements generalize beyond this setting, adoption of GenAI in education could accelerate skill acquisition at scale, potentially affecting the future labor supply quality and the pace of technological diffusion.
- Need for field-specific evaluation: Economic models of AI’s impact should account for heterogeneity across domains and the role of user skill and training in converting access into productivity gains.

Limitations for economic interpretation: effects are task- and discipline-specific, sample is university students (mostly engineering in Spain), and the study captures short-run assessment outcomes rather than long-run learning or labor-market impacts. Future economic work should estimate long-term returns to AI-augmented education, distributional effects, and cost–benefit of training programs.

Assessment

Paper Typerct Evidence Strengthmedium — Random assignment gives credible causal identification for short-term task performance within the sampled courses, and the study uses appropriate statistical controls and robustness checks; however, the sample is modest (254 students across six courses at two institutions), effects are heterogeneous across disciplines, outcomes are short‑term knowledge-based measures (no long-term learning or labor-market outcomes), and some key behavioral measures (edits) are observational, limiting external validity and broader economic inference. Methods Rigormedium — The study combines a randomized design with GLMs, non-parametric tests, and systematic qualitative coding, which is a strong multi-method approach; but limitations include modest sample size, potential noncompliance/variation in actual tool use, lack of longer-term outcome measurement, unclear pre-registration or power analysis reported here, and observational measures for engagement that could suffer from confounding. Sample254 undergraduate and graduate students enrolled in six courses spanning engineering and non-engineering subjects at two institutions; participants were randomized by course/assignment into allowed-ChatGPT versus control (non-GenAI) conditions; data include course task scores, logged usage/edits of ChatGPT (for treated students), and open-ended survey responses on perceptions and usage. Themesskills_training human_ai_collab productivity inequality IdentificationRandomized/experimental assignment of students within six courses to either (a) allowed use of ChatGPT for knowledge-based tasks (treatment) or (b) restricted to traditional/non-GenAI resources (control); causal effects estimated via group comparisons, GLMs controlling for covariates, and robustness checks with non-parametric tests. Observational correlations (e.g., number of edits → score) are used to study usage heterogeneity but are not causally identified. GeneralizabilityShort-term, knowledge-based academic tasks only — not measures of long-term learning or retention, Sample limited to six courses at two institutions — not nationally or internationally representative, Heterogeneous effects across disciplines — results may not generalize across fields or task types (e.g., creative vs. procedural), Observational usage measures (edits) vulnerable to selection/confounding; causal interpretation limited, Does not measure downstream labor-market outcomes or firm-level productivity, Potential variation by students' prior skill, socioeconomic status, or access to paid AI tiers not fully explored

Claims (13)

Claim	Direction	Confidence	Outcome	Details
Allowing students to use ChatGPT on knowledge-based academic tasks led to generally higher scores compared with control groups restricted to non-GenAI resources. Output Quality	positive	medium	student task/course scores (short-term performance on knowledge-based tasks)	n=254 statistically significant (overall) 0.36
The improvement from allowing ChatGPT use was statistically significant in specific courses (examples named: computer systems administration, informatics, childhood disorders). Output Quality	positive	medium	course/task scores within specified courses	n=254 statistically significant in some courses 0.36
There is a weak but statistically significant positive relationship between iterative engagement with ChatGPT (measured by number of edits to the tool's outputs) and better academic performance. Output Quality	positive	medium	student task/course scores (correlated with number of edits)	n=254 weak but statistically significant positive correlation 0.36
Students reported that ChatGPT provided faster access to information, helped clarify concepts, and aided organization (e.g., outlining and summarizing). Worker Satisfaction	positive	medium	student-reported perceived usefulness/benefits	n=254 0.36
Students raised concerns about ChatGPT producing factual errors, the risk of overreliance that could reduce independent thinking, and functional constraints of free ChatGPT versions. Ai Safety And Ethics	negative	medium	student-reported concerns and perceived risks	n=254 0.36
Effectiveness of ChatGPT varied by discipline; not all course contexts showed significant gains from allowing its use. Output Quality	mixed	medium	course/task scores (heterogeneous effects across disciplines)	n=254 heterogeneous by discipline 0.36
The study focused on short-term, knowledge-based tasks and did not measure long-term learning or retention. Skill Acquisition	null_result	high	long-term learning/retention (not measured)	n=254 0.6
The study did not directly measure accessibility or impacts on students with disabilities, though qualitative results suggest possible intersections with inclusive and multimodal learning design. Other	null_result	high	accessibility/disability-related educational outcomes (not measured)	n=254 0.6
Based on findings and student-reported concerns, the authors recommend integrating explicit AI-literacy instruction to support critical and reflective use of Generative AI tools in education. Training Effectiveness	positive	medium	recommendation for AI-literacy instruction (policy/educational intervention)	0.36
The study employed a multi-method approach combining experimental quantitative analysis (descriptives, GLM, non-parametric robustness checks) with qualitative topic-based coding of open-ended survey responses. Research Productivity	null_result	high	study methodology (mixed-methods design)	n=254 0.6
Observed higher short-term performance and the positive correlation with iterative engagement imply that GenAI can augment short-term academic productivity and that benefits depend partly on active, skillful user interaction (complementarity). Output Quality	positive	speculative	short-term academic productivity (inferred/complementarity interpretation)	n=254 interpretive inference of complementarity 0.06
Differential access to higher-quality (paid) versus free GenAI tools and differing ability to engage with the tool could widen inequality among students and institutions. Inequality	negative	speculative	equity/inequality in access and learning outcomes (not directly measured)	0.06
The study has potential selection and ecological-validity constraints because it was conducted at two institutions across six courses, limiting generalizability. Research Productivity	null_result	high	external validity/generalizability (limitation)	n=254 0.6