The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

Only about half of CHI papers that publish data and code are fully reproducible: missing or poorly documented data, non-runnable analysis, and environment mismatches are the main culprits. The authors urge mandatory executable artifacts, environment capture (containers), clear documentation and incentives to raise reproducibility and strengthen empirical credibility.

On the Computational Reproducibility of Human-Computer Interaction
Olga Iarygina, Kasper Anders Søren Hornbæk, Aske ; id_orcid 0000-0003-1827-8513 Mottelson · Fetched March 10, 2026 · IT University Of Copenhagen (IT University of Copenhagen)
openalex review_meta medium evidence 7/10 relevance DOI Source PDF
Attempting to rerun shared data and code from CHI papers, the authors fully reproduced reported results for 49% of cases, identified missing data, non-runnable or undocumented code, preprocessing and environment issues as common barriers, and recommend standards, executable artifacts, and infrastructure to improve reproducibility.

An increasing number of HCI researchers have embraced open science ideas, such as sharing data and analysis code. However, such practices are only meaningful when the shared data and code enable other researchers to reproduce and reuse the reported findings. To investigate the reproducibility of HCI research, we identified all CHI papers that have shared study data and analysis code, and attempted to reproduce the results. We were able to fully reproduce 49\% of the papers. We surveyed and interviewed authors, asking them to assess the reproducibility of their own work and to reflect on their motivations and obstacles in doing open science. We discuss what improves and hinders reproducibility and provide recommendations on how to increase reproducibility rates in HCI. While the value of replicability remains contested in HCI, we argue that the more modest goal of reproducibility is desirable.

Summary

Main Finding

The authors attempted to reproduce results from all CHI papers that had publicly shared study data and analysis code. They were able to fully reproduce the reported results for 49% of those papers. Through surveys and interviews with paper authors, they identified common motivations, obstacles, and practices that affect reproducibility, and they propose concrete recommendations to raise reproducibility rates in HCI. They argue that reproducibility (re-running the same data+code to obtain reported results) is a practical and valuable goal for the field even where full replicability (independent repetition) remains contested.

Key Points

  • Reproducibility rate: 49% of CHI papers with shared data and code were fully reproducible.
  • Definition: Reproducibility here means producing the reported results from the shared data and analysis code; distinct from replicability (new data collection).
  • Common barriers to reproducibility:
    • Missing or incomplete data, or data not accessible in the form used in the paper.
    • Incomplete, non-runnable, or poorly documented analysis code.
    • Unspecified preprocessing steps, parameter settings, or random seeds.
    • Environment and dependency issues (library versions, platform differences).
    • Time/resource costs for re-running analyses; lack of computational environment capture (containers/notebooks).
    • Ethical/privacy/legal restrictions limiting data sharing.
  • Motivations cited by authors for sharing: transparency, community norms, potential re-use, and perceived credit.
  • Practical enablers: clear documentation (readme, data dictionaries), executable artifacts (notebooks, scripts), environment specification (Docker/conda, container images), provenance of preprocessing steps, and persistent hosting (DOIs).
  • Recommendations (high level): adopt standards and checklists, require or encourage executable artifacts, train researchers in reproducible workflows, improve incentives (credit/badges), and provide infrastructure and reviewer guidelines to evaluate artifacts.
  • Positioning: Paper argues reproducibility is a modest, attainable goal that yields meaningful gains in credibility and reuse even when full replication is difficult or contentious.

Data & Methods

  • Population: All CHI conference papers that had publicly shared study data and analysis code (authors identified this set; exact date range unspecified in the summary).
  • Reproduction attempts: The authors tried to run the shared artifacts to obtain the reported results. Outcomes were classified (e.g., fully reproducible, partially reproducible, not reproducible).
  • Author elicitation: A survey of paper authors plus follow-up interviews to collect self-assessments of reproducibility and qualitative insights about motivations and obstacles.
  • Analysis: Triangulation of empirical reproduction outcomes with survey/interview responses to identify common failure modes and effective practices.
  • Limitations noted (implicit from methods): sample restricted to CHI papers that already shared artifacts (self-selection bias — results may overestimate reproducibility among all papers); reproducibility attempts depend on effort allocation and may be sensitive to undocumented context; findings are specific to HCI/CHI and may not generalize without adaptation.

Implications for AI Economics

  • Credibility of empirical work: Reproducibility is foundational for trusting empirical claims about AI’s economic impacts (productivity effects, labor market outcomes, welfare analysis). If near-half of shared artifacts in a well-resourced field are not reproducible, empirical AI-economics studies that rely on code/data sharing may face similar fragility.
  • Measurement quality and policy advice: Irreproducible analyses can lead to incorrect estimates (e.g., of adoption effects, complementarities, or distributional outcomes). Policy decisions and economic models that depend on such results risk being misinformed.
  • Incentives and institutions: AI economics should adopt similar reproducibility norms—artifact badges, mandatory artifact submission, registered reports, and reproducibility checks—so that published findings can be audited and built upon reliably.
  • Methodological practice: Economists working on AI should provide full analysis pipelines, containerized execution environments (Docker/virtual environments), data dictionaries, and synthetic or privacy-preserving versions of datasets when legal/ethical constraints prevent sharing raw data.
  • Cost and funding implications: Ensuring reproducibility requires researcher time and infrastructure. Funding agencies and journals should fund reproducibility support (compute credits, curation staff), and explicitly reward reproducible outputs in hiring/promotion and grant assessment.
  • Dealing with proprietary/ confidential data: AI economics often uses firm or platform data that cannot be openly released. The paper’s lessons motivate alternatives: reproducible synthetic datasets, safe data enclaves with audit logs, standardized code release with data access protocols, or third‑party reproducibility audits.
  • Meta‑analysis and cumulative science: Higher reproducibility enables reliable meta-analyses and the accumulation of evidence about AI’s economic effects. Without it, combining studies or estimating general equilibrium effects becomes riskier.
  • Practical checklist for AI economics (derived from paper’s recommendations):
    • Release analysis code and (where possible) data, or provide clear data access procedures.
    • Include a runnable script/notebook that reproduces key tables/figures.
    • Provide environment capture (container, requirements.txt) and specify versions.
    • Document preprocessing, random seeds, and parameter choices.
    • Use persistent hosting (DOI) and standard metadata.
    • When data cannot be shared, provide code plus simulated/synthetic data and a detailed data schema.

Overall, the study underlines that improving reproducibility is a tractable and high-value intervention for AI economics: it strengthens credibility, supports reuse and meta‑analysis, and reduces the risk that policy and economic conclusions rest on unreproducible results.

Assessment

Paper Typereview_meta Evidence Strengthmedium — Direct, empirical reproduction attempts across an entire set of CHI papers with shared artifacts provide concrete evidence about reproducibility rates and failure modes, and this is strengthened by triangulation with surveys and interviews; however, the sample is self-selected (only papers that already shared data/code), effort and protocols for reproduction can vary, and findings are specific to HCI/CHI, limiting external validity. Methods Rigormedium — The study systematically attempted to run shared artifacts and classified outcomes, and combined this with author surveys and interviews for qualitative insight—good triangulation and transparency—yet details on reproduction protocols, effort allocation, and date range are unspecified, and selection bias (only papers that shared artifacts) and potential variability in reproducibility attempts reduce methodological rigor. SampleAll CHI conference papers (date range unspecified) that publicly shared study data and analysis code, where the authors attempted to re-run available artifacts and classified outcomes (fully/partially/not reproducible), supplemented by a survey of paper authors and follow-up interviews collecting motivations, obstacles, and practices. Themesgovernance adoption GeneralizabilityRestricted to CHI (HCI) conference papers and may not generalize to other fields (e.g., economics, computer science, industry studies)., Sample limited to papers that already shared data/code (self-selection bias); likely overestimates reproducibility among all papers., Reproducibility outcomes depend on the effort, expertise, and environment of the reproducing team; results may vary with different protocols., Findings may not extend to studies using proprietary, confidential, or large-scale production datasets common in AI economics., Time-bound practices (tooling, dependency management) evolve, so rates may change over time.

Claims (14)

ClaimDirectionConfidenceOutcomeDetails
The authors were able to fully reproduce the reported results for 49% of CHI papers that had publicly shared study data and analysis code. Research Productivity mixed high proportion of papers whose reported results could be fully reproduced from the shared data and analysis code
49%
0.24
Reproducibility (as used in this study) is defined as producing the reported results from the shared data and analysis code, distinct from replicability which involves independent recollection of data. Research Productivity null_result high operational definition of 'reproducibility' (ability to re-run provided data+code to obtain reported results)
0.24
A common barrier to reproducing results is missing or incomplete data, or data not accessible in the exact form used in the paper. Research Productivity negative medium frequency/prevalence of missing or incomplete data as a cause of irreproducibility
0.14
Incomplete, non-runnable, or poorly documented analysis code is a frequent obstacle to reproducibility. Research Productivity negative medium incidence of code-related failures preventing reproduction (non-runnable or poorly documented code)
0.14
Unspecified preprocessing steps, parameter settings, or random seeds often prevent exact reproduction of reported results. Research Productivity negative medium occurrence of undocumented preprocessing/parameter choices as a barrier to reproducing results
0.14
Environment and dependency issues (library versions, platform differences) are common reproducibility problems. Research Productivity negative medium frequency of environment/dependency issues causing irreproducibility
0.14
Time/resource costs for re-running analyses and lack of computational environment capture (e.g., Docker/conda containers) increase the difficulty of reproducing results. Research Productivity negative medium reported burden (time/compute) and absence of environment capture as barriers to reproduction
0.14
Ethical, privacy, and legal restrictions sometimes limit the ability to share data and thereby hamper reproducibility. Research Productivity negative high incidence of data-sharing restrictions affecting reproducibility
0.24
Authors who shared artifacts cited motivations such as transparency, community norms, potential re-use, and perceived credit for sharing. Adoption Rate positive medium self-reported motivations for artifact sharing among CHI paper authors
0.14
Practical enablers of reproducibility include clear documentation (readme, data dictionaries), executable artifacts (notebooks, runnable scripts), explicit environment specification (Docker/conda), provenance of preprocessing steps, and persistent hosting (DOIs). Research Productivity positive medium presence of documentation/executable/environment artifacts associated with successful reproduction
0.14
The authors recommend adopting standards and checklists, encouraging or requiring executable artifacts, training researchers in reproducible workflows, improving incentives (credit/badges), and providing infrastructure and reviewer guidelines to evaluate artifacts. Governance And Regulation positive medium recommended policy/practice changes intended to increase reproducibility (not directly measured)
0.14
The study population was restricted to CHI conference papers that had publicly shared study data and analysis code (a self-selected subset), which introduces a self-selection bias that may overestimate reproducibility rates for the broader set of CHI papers. Research Productivity negative high generalizability of the measured reproducibility rate (bias due to sampling)
0.24
The authors elicited additional insights via a survey of paper authors plus follow-up interviews to collect self-assessments of reproducibility and qualitative explanations for obstacles and motivations. Research Productivity null_result high use of surveys and interviews as data sources for qualitative corroboration and explanation
0.24
Reproducibility is a practical and valuable goal for the HCI field even where full independent replication remains contested. Research Productivity positive medium assessment of reproducibility's attainability and value (conceptual/argumentative claim rather than an empirically measured outcome)
0.14

Notes