Only about half of CHI papers that publish data and code are fully reproducible: missing or poorly documented data, non-runnable analysis, and environment mismatches are the main culprits. The authors urge mandatory executable artifacts, environment capture (containers), clear documentation and incentives to raise reproducibility and strengthen empirical credibility.

On the Computational Reproducibility of Human-Computer Interaction

Olga Iarygina, Kasper Anders Søren Hornbæk, Aske ; id_orcid 0000-0003-1827-8513 Mottelson · April 13, 2026 · IT University Of Copenhagen (IT University of Copenhagen)

openalex review_meta medium evidence 7/10 relevance DOI Source PDF

Attempting to rerun shared data and code from CHI papers, the authors fully reproduced reported results for 49% of cases, identified missing data, non-runnable or undocumented code, preprocessing and environment issues as common barriers, and recommend standards, executable artifacts, and infrastructure to improve reproducibility.

An increasing number of HCI researchers have embraced open science ideas, such as sharing data and analysis code. However, such practices are only meaningful when the shared data and code enable other researchers to reproduce and reuse the reported findings. To investigate the reproducibility of HCI research, we identified all CHI papers that have shared study data and analysis code, and attempted to reproduce the results. We were able to fully reproduce 49\% of the papers. We surveyed and interviewed authors, asking them to assess the reproducibility of their own work and to reflect on their motivations and obstacles in doing open science. We discuss what improves and hinders reproducibility and provide recommendations on how to increase reproducibility rates in HCI. While the value of replicability remains contested in HCI, we argue that the more modest goal of reproducibility is desirable.

Summary

Main Finding

The authors attempted computational reproduction of 76 CHI papers (2007–2024) that openly provided both study data and analysis code. They fully reproduced 36 papers (≈49%), partially reproduced 23 (≈30%), and could not reproduce 15 (≈20%). Only about 1% of all CHI papers had repositories that met the inclusion criteria (data + analysis code). The study finds that while HCI increasingly adopts open-science rhetoric, the quality and usability of shared materials are often insufficient to guarantee reproducibility.

Key Points

Scope and outcome
- Sample: 76 CHI papers with publicly linked repositories (GitHub, OSF, Zenodo) that contained both data and analysis code.
- Results: 36 fully reproducible, 23 partially reproducible, 15 not reproducible.
- Sharing rate: roughly 1 in 100 CHI papers published code+data in a way that enabled a reproduction attempt.
Common obstacles to reproducibility
- Broken or private links; missing or incomplete data files.
- Lack of README or minimal/poor documentation; no codebook for variables.
- Multiple scripts with no indicated execution order; hard-coded paths.
- Undocumented custom functions and sparse code comments; large outputs that hide reported values.
- OS- or platform-specific dependencies, and reliance on obscure or nonstandard data formats.
- Discrepancies between code output and reported paper values.
Helpful practices identified
- Detailed README/Wiki: repository structure, software requirements, execution order.
- Use of RStudio project files (.Rproj) or other project scaffolding to simplify paths.
- Clear variable names or codebooks, standard data formats (.csv/.json), annotated scripts.
- Authors testing others to run their code; keeping repositories on FAIR-compliant platforms (OSF/Zenodo).
Author survey & interviews
- Authors cited mixed motivation; barriers include lack of incentives, collaborator resistance, and effort required to prepare reproducible artifacts.
- Recommendations from interviewed authors: clearer instructions, better documentation, consistent naming, and asking peers to execute the code pre-release.
Technical context
- Most analysis code written in Python or R; smaller use of SPSS, MATLAB, JASP.
- Reproduction attempts were performed on a MacBook Pro (Apple M3, macOS 15.4.1); troubleshooting beyond ~45 minutes per blocking error was treated as unsuccessful.

Data & Methods

Repository discovery
- Searched full texts of CHI papers (2007–2024) for occurrences of “git*”, “osf”, and “zenodo”.
- Extracted 836 occurrences → 654 unique URLs → manual coding → 76 repositories that contained both analysis code and relevant data.
Inclusion criteria
- Repositories had to publicly include the data and the code used to generate the reported study findings.
- Platforms considered: GitHub, OSF, Zenodo (chosen for prevalence and FAIR considerations).
Reproduction workflow
- For each repo: locate repository → inspect README/Wiki → identify data files → run code → compare outputs (numerics/figures) to paper.
- Execution environment: local machine (MacBook Pro as above). Languages: mainly Python and R.
- Time policy: followed prior literature practice—if >45 minutes of troubleshooting didn’t resolve critical blocking errors, the attempt was considered unsuccessful.
Outcome coding
- Fully reproducible: reproduced the exact reported numerical/figure outputs (allowed minor fixes like path updates or inferring variable names).
- Partially reproducible: some but not all outputs matched.
- Not reproducible: could not run the code to produce matching outputs or required materials were missing.

Implications for AI Economics

The paper’s findings have several direct implications for research and policy in AI economics, a subfield that frequently depends on computational analyses, large datasets, and complex modeling.

Reproducibility as a credibility and policy-risk issue
- Empirical AI economics informs regulatory policy, market design, investment decisions, and firm strategy. Low computational reproducibility weakens confidence in results used to shape high-stakes economic decisions.
- Nonreproducible results increase the risk of misguided policies or misallocation of capital in AI-driven markets.
Economic costs and inefficiencies
- Time and effort wasted attempting to rerun results (missing data, broken pipelines) represent inefficiencies. Reproducibility failures add friction to cumulative science, raising the social cost of research verification.
- For funders and institutions, poor reproducibility reduces the return on R&D investments because outputs are less reusable or verifiable.
Incentives and public-good nature of shared artifacts
- Code and cleaned datasets are non-rival public goods with positive externalities (facilitating replication, extension, meta-analysis). Market/private incentives often underprovide them.
- Policy levers: funders and journals can internalize these externalities via mandates, artifacts badges, or crediting systems (citations, data DOIs, promotion criteria) to encourage sharing and adequate documentation.
Practical recommendations for AI economics researchers
- Platform & persistence: prefer stable, FAIR-compliant archives (OSF, Zenodo) or provide archived releases (e.g., GitHub + Zenodo DOI) to ensure long-term access.
- Minimum reproducibility checklist: README with execution order and environment; environment specification (package versions, Docker/Conda/renv or container images); data schema/codebook; automated run script that produces all reported tables/figures.
- Sensitive and proprietary data: when underlying firm/platform data cannot be shared, researchers should (a) share code and synthetic or example datasets, (b) provide precise queries and data generation scripts, (c) use data enclaves or audited reproducibility procedures, or (d) include an independent auditor or replicability review as part of publication where feasible.
- Use of containers and CI: distribute Docker/OCI images or reproducible environments, and use continuous-integration tests to ensure scripts run after release.
- Documentation & testing norms: require minimal unit tests or smoke tests that verify key summary statistics and enable rapid triage of execution issues.
Institutional and policy levers tailored to AI economics
- Journals, conferences, and funders in AI economics should set clear reproducibility standards (not just “share if possible”) and require (or strongly incentivize) executable artifacts at submission or acceptance.
- Support for data stewardship: funders could finance the extra labor of preparing reproducible artifacts (data curation, containerization) as eligible project costs.
- Badge systems and curated artifact reviews: adopt or adapt artifact-review/badging schemes that attest to accessibility and computational reproducibility.
- Handling proprietary/private data: create standard contractual templates and technical workflows (e.g., synthetic datasets, privacy-preserving releases, audited enclaves) so reproducibility is achievable even for sensitive datasets.
Research agenda and measurement
- AI economics should track reproducibility rates over time (by venue/subfield) to measure progress and guide interventions.
- Economic research could model the cost–benefit tradeoffs of reproducibility requirements (time to prepare vs. increased reuse and error-detection benefits) to inform policy on mandating reproducibility.

Overall, the HCI study highlights a familiar pattern likely relevant to AI economics: the existence of shared materials does not guarantee reproducibility. For AI economics to deliver reliable, policy-relevant findings, the field should adopt stronger norms, tooling (containers, environment specs), and incentives (funding, badges, credit) to ensure that shared code and data are actually usable.

Assessment

Paper Typereview_meta Evidence Strengthmedium — Direct, empirical reproduction attempts across an entire set of CHI papers with shared artifacts provide concrete evidence about reproducibility rates and failure modes, and this is strengthened by triangulation with surveys and interviews; however, the sample is self-selected (only papers that already shared data/code), effort and protocols for reproduction can vary, and findings are specific to HCI/CHI, limiting external validity. Methods Rigormedium — The study systematically attempted to run shared artifacts and classified outcomes, and combined this with author surveys and interviews for qualitative insight—good triangulation and transparency—yet details on reproduction protocols, effort allocation, and date range are unspecified, and selection bias (only papers that shared artifacts) and potential variability in reproducibility attempts reduce methodological rigor. SampleAll CHI conference papers (date range unspecified) that publicly shared study data and analysis code, where the authors attempted to re-run available artifacts and classified outcomes (fully/partially/not reproducible), supplemented by a survey of paper authors and follow-up interviews collecting motivations, obstacles, and practices. Themesgovernance adoption GeneralizabilityRestricted to CHI (HCI) conference papers and may not generalize to other fields (e.g., economics, computer science, industry studies)., Sample limited to papers that already shared data/code (self-selection bias); likely overestimates reproducibility among all papers., Reproducibility outcomes depend on the effort, expertise, and environment of the reproducing team; results may vary with different protocols., Findings may not extend to studies using proprietary, confidential, or large-scale production datasets common in AI economics., Time-bound practices (tooling, dependency management) evolve, so rates may change over time.

Claims (14)

Claim	Direction	Confidence	Outcome	Details
The authors were able to fully reproduce the reported results for 49% of CHI papers that had publicly shared study data and analysis code. Research Productivity	mixed	high	proportion of papers whose reported results could be fully reproduced from the shared data and analysis code	49% 0.24
Reproducibility (as used in this study) is defined as producing the reported results from the shared data and analysis code, distinct from replicability which involves independent recollection of data. Research Productivity	null_result	high	operational definition of 'reproducibility' (ability to re-run provided data+code to obtain reported results)	0.24
A common barrier to reproducing results is missing or incomplete data, or data not accessible in the exact form used in the paper. Research Productivity	negative	medium	frequency/prevalence of missing or incomplete data as a cause of irreproducibility	0.14
Incomplete, non-runnable, or poorly documented analysis code is a frequent obstacle to reproducibility. Research Productivity	negative	medium	incidence of code-related failures preventing reproduction (non-runnable or poorly documented code)	0.14
Unspecified preprocessing steps, parameter settings, or random seeds often prevent exact reproduction of reported results. Research Productivity	negative	medium	occurrence of undocumented preprocessing/parameter choices as a barrier to reproducing results	0.14
Environment and dependency issues (library versions, platform differences) are common reproducibility problems. Research Productivity	negative	medium	frequency of environment/dependency issues causing irreproducibility	0.14
Time/resource costs for re-running analyses and lack of computational environment capture (e.g., Docker/conda containers) increase the difficulty of reproducing results. Research Productivity	negative	medium	reported burden (time/compute) and absence of environment capture as barriers to reproduction	0.14
Ethical, privacy, and legal restrictions sometimes limit the ability to share data and thereby hamper reproducibility. Research Productivity	negative	high	incidence of data-sharing restrictions affecting reproducibility	0.24
Authors who shared artifacts cited motivations such as transparency, community norms, potential re-use, and perceived credit for sharing. Adoption Rate	positive	medium	self-reported motivations for artifact sharing among CHI paper authors	0.14
Practical enablers of reproducibility include clear documentation (readme, data dictionaries), executable artifacts (notebooks, runnable scripts), explicit environment specification (Docker/conda), provenance of preprocessing steps, and persistent hosting (DOIs). Research Productivity	positive	medium	presence of documentation/executable/environment artifacts associated with successful reproduction	0.14
The authors recommend adopting standards and checklists, encouraging or requiring executable artifacts, training researchers in reproducible workflows, improving incentives (credit/badges), and providing infrastructure and reviewer guidelines to evaluate artifacts. Governance And Regulation	positive	medium	recommended policy/practice changes intended to increase reproducibility (not directly measured)	0.14
The study population was restricted to CHI conference papers that had publicly shared study data and analysis code (a self-selected subset), which introduces a self-selection bias that may overestimate reproducibility rates for the broader set of CHI papers. Research Productivity	negative	high	generalizability of the measured reproducibility rate (bias due to sampling)	0.24
The authors elicited additional insights via a survey of paper authors plus follow-up interviews to collect self-assessments of reproducibility and qualitative explanations for obstacles and motivations. Research Productivity	null_result	high	use of surveys and interviews as data sources for qualitative corroboration and explanation	0.24
Reproducibility is a practical and valuable goal for the HCI field even where full independent replication remains contested. Research Productivity	positive	medium	assessment of reproducibility's attainability and value (conceptual/argumentative claim rather than an empirically measured outcome)	0.14