Only about half of CHI papers that publish data and code are fully reproducible: missing or poorly documented data, non-runnable analysis, and environment mismatches are the main culprits. The authors urge mandatory executable artifacts, environment capture (containers), clear documentation and incentives to raise reproducibility and strengthen empirical credibility.
An increasing number of HCI researchers have embraced open science ideas, such as sharing data and analysis code. However, such practices are only meaningful when the shared data and code enable other researchers to reproduce and reuse the reported findings. To investigate the reproducibility of HCI research, we identified all CHI papers that have shared study data and analysis code, and attempted to reproduce the results. We were able to fully reproduce 49\% of the papers. We surveyed and interviewed authors, asking them to assess the reproducibility of their own work and to reflect on their motivations and obstacles in doing open science. We discuss what improves and hinders reproducibility and provide recommendations on how to increase reproducibility rates in HCI. While the value of replicability remains contested in HCI, we argue that the more modest goal of reproducibility is desirable.
Summary
Main Finding
The authors attempted computational reproduction of 76 CHI papers (2007–2024) that openly provided both study data and analysis code. They fully reproduced 36 papers (≈49%), partially reproduced 23 (≈30%), and could not reproduce 15 (≈20%). Only about 1% of all CHI papers had repositories that met the inclusion criteria (data + analysis code). The study finds that while HCI increasingly adopts open-science rhetoric, the quality and usability of shared materials are often insufficient to guarantee reproducibility.
Key Points
- Scope and outcome
- Sample: 76 CHI papers with publicly linked repositories (GitHub, OSF, Zenodo) that contained both data and analysis code.
- Results: 36 fully reproducible, 23 partially reproducible, 15 not reproducible.
- Sharing rate: roughly 1 in 100 CHI papers published code+data in a way that enabled a reproduction attempt.
- Common obstacles to reproducibility
- Broken or private links; missing or incomplete data files.
- Lack of README or minimal/poor documentation; no codebook for variables.
- Multiple scripts with no indicated execution order; hard-coded paths.
- Undocumented custom functions and sparse code comments; large outputs that hide reported values.
- OS- or platform-specific dependencies, and reliance on obscure or nonstandard data formats.
- Discrepancies between code output and reported paper values.
- Helpful practices identified
- Detailed README/Wiki: repository structure, software requirements, execution order.
- Use of RStudio project files (.Rproj) or other project scaffolding to simplify paths.
- Clear variable names or codebooks, standard data formats (.csv/.json), annotated scripts.
- Authors testing others to run their code; keeping repositories on FAIR-compliant platforms (OSF/Zenodo).
- Author survey & interviews
- Authors cited mixed motivation; barriers include lack of incentives, collaborator resistance, and effort required to prepare reproducible artifacts.
- Recommendations from interviewed authors: clearer instructions, better documentation, consistent naming, and asking peers to execute the code pre-release.
- Technical context
- Most analysis code written in Python or R; smaller use of SPSS, MATLAB, JASP.
- Reproduction attempts were performed on a MacBook Pro (Apple M3, macOS 15.4.1); troubleshooting beyond ~45 minutes per blocking error was treated as unsuccessful.
Data & Methods
- Repository discovery
- Searched full texts of CHI papers (2007–2024) for occurrences of “git*”, “osf”, and “zenodo”.
- Extracted 836 occurrences → 654 unique URLs → manual coding → 76 repositories that contained both analysis code and relevant data.
- Inclusion criteria
- Repositories had to publicly include the data and the code used to generate the reported study findings.
- Platforms considered: GitHub, OSF, Zenodo (chosen for prevalence and FAIR considerations).
- Reproduction workflow
- For each repo: locate repository → inspect README/Wiki → identify data files → run code → compare outputs (numerics/figures) to paper.
- Execution environment: local machine (MacBook Pro as above). Languages: mainly Python and R.
- Time policy: followed prior literature practice—if >45 minutes of troubleshooting didn’t resolve critical blocking errors, the attempt was considered unsuccessful.
- Outcome coding
- Fully reproducible: reproduced the exact reported numerical/figure outputs (allowed minor fixes like path updates or inferring variable names).
- Partially reproducible: some but not all outputs matched.
- Not reproducible: could not run the code to produce matching outputs or required materials were missing.
Implications for AI Economics
The paper’s findings have several direct implications for research and policy in AI economics, a subfield that frequently depends on computational analyses, large datasets, and complex modeling.
-
Reproducibility as a credibility and policy-risk issue
- Empirical AI economics informs regulatory policy, market design, investment decisions, and firm strategy. Low computational reproducibility weakens confidence in results used to shape high-stakes economic decisions.
- Nonreproducible results increase the risk of misguided policies or misallocation of capital in AI-driven markets.
-
Economic costs and inefficiencies
- Time and effort wasted attempting to rerun results (missing data, broken pipelines) represent inefficiencies. Reproducibility failures add friction to cumulative science, raising the social cost of research verification.
- For funders and institutions, poor reproducibility reduces the return on R&D investments because outputs are less reusable or verifiable.
-
Incentives and public-good nature of shared artifacts
- Code and cleaned datasets are non-rival public goods with positive externalities (facilitating replication, extension, meta-analysis). Market/private incentives often underprovide them.
- Policy levers: funders and journals can internalize these externalities via mandates, artifacts badges, or crediting systems (citations, data DOIs, promotion criteria) to encourage sharing and adequate documentation.
-
Practical recommendations for AI economics researchers
- Platform & persistence: prefer stable, FAIR-compliant archives (OSF, Zenodo) or provide archived releases (e.g., GitHub + Zenodo DOI) to ensure long-term access.
- Minimum reproducibility checklist: README with execution order and environment; environment specification (package versions, Docker/Conda/renv or container images); data schema/codebook; automated run script that produces all reported tables/figures.
- Sensitive and proprietary data: when underlying firm/platform data cannot be shared, researchers should (a) share code and synthetic or example datasets, (b) provide precise queries and data generation scripts, (c) use data enclaves or audited reproducibility procedures, or (d) include an independent auditor or replicability review as part of publication where feasible.
- Use of containers and CI: distribute Docker/OCI images or reproducible environments, and use continuous-integration tests to ensure scripts run after release.
- Documentation & testing norms: require minimal unit tests or smoke tests that verify key summary statistics and enable rapid triage of execution issues.
-
Institutional and policy levers tailored to AI economics
- Journals, conferences, and funders in AI economics should set clear reproducibility standards (not just “share if possible”) and require (or strongly incentivize) executable artifacts at submission or acceptance.
- Support for data stewardship: funders could finance the extra labor of preparing reproducible artifacts (data curation, containerization) as eligible project costs.
- Badge systems and curated artifact reviews: adopt or adapt artifact-review/badging schemes that attest to accessibility and computational reproducibility.
- Handling proprietary/private data: create standard contractual templates and technical workflows (e.g., synthetic datasets, privacy-preserving releases, audited enclaves) so reproducibility is achievable even for sensitive datasets.
-
Research agenda and measurement
- AI economics should track reproducibility rates over time (by venue/subfield) to measure progress and guide interventions.
- Economic research could model the cost–benefit tradeoffs of reproducibility requirements (time to prepare vs. increased reuse and error-detection benefits) to inform policy on mandating reproducibility.
Overall, the HCI study highlights a familiar pattern likely relevant to AI economics: the existence of shared materials does not guarantee reproducibility. For AI economics to deliver reliable, policy-relevant findings, the field should adopt stronger norms, tooling (containers, environment specs), and incentives (funding, badges, credit) to ensure that shared code and data are actually usable.
Assessment
Claims (14)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| The authors were able to fully reproduce the reported results for 49% of CHI papers that had publicly shared study data and analysis code. Research Productivity | mixed | high | proportion of papers whose reported results could be fully reproduced from the shared data and analysis code |
49%
0.24
|
| Reproducibility (as used in this study) is defined as producing the reported results from the shared data and analysis code, distinct from replicability which involves independent recollection of data. Research Productivity | null_result | high | operational definition of 'reproducibility' (ability to re-run provided data+code to obtain reported results) |
0.24
|
| A common barrier to reproducing results is missing or incomplete data, or data not accessible in the exact form used in the paper. Research Productivity | negative | medium | frequency/prevalence of missing or incomplete data as a cause of irreproducibility |
0.14
|
| Incomplete, non-runnable, or poorly documented analysis code is a frequent obstacle to reproducibility. Research Productivity | negative | medium | incidence of code-related failures preventing reproduction (non-runnable or poorly documented code) |
0.14
|
| Unspecified preprocessing steps, parameter settings, or random seeds often prevent exact reproduction of reported results. Research Productivity | negative | medium | occurrence of undocumented preprocessing/parameter choices as a barrier to reproducing results |
0.14
|
| Environment and dependency issues (library versions, platform differences) are common reproducibility problems. Research Productivity | negative | medium | frequency of environment/dependency issues causing irreproducibility |
0.14
|
| Time/resource costs for re-running analyses and lack of computational environment capture (e.g., Docker/conda containers) increase the difficulty of reproducing results. Research Productivity | negative | medium | reported burden (time/compute) and absence of environment capture as barriers to reproduction |
0.14
|
| Ethical, privacy, and legal restrictions sometimes limit the ability to share data and thereby hamper reproducibility. Research Productivity | negative | high | incidence of data-sharing restrictions affecting reproducibility |
0.24
|
| Authors who shared artifacts cited motivations such as transparency, community norms, potential re-use, and perceived credit for sharing. Adoption Rate | positive | medium | self-reported motivations for artifact sharing among CHI paper authors |
0.14
|
| Practical enablers of reproducibility include clear documentation (readme, data dictionaries), executable artifacts (notebooks, runnable scripts), explicit environment specification (Docker/conda), provenance of preprocessing steps, and persistent hosting (DOIs). Research Productivity | positive | medium | presence of documentation/executable/environment artifacts associated with successful reproduction |
0.14
|
| The authors recommend adopting standards and checklists, encouraging or requiring executable artifacts, training researchers in reproducible workflows, improving incentives (credit/badges), and providing infrastructure and reviewer guidelines to evaluate artifacts. Governance And Regulation | positive | medium | recommended policy/practice changes intended to increase reproducibility (not directly measured) |
0.14
|
| The study population was restricted to CHI conference papers that had publicly shared study data and analysis code (a self-selected subset), which introduces a self-selection bias that may overestimate reproducibility rates for the broader set of CHI papers. Research Productivity | negative | high | generalizability of the measured reproducibility rate (bias due to sampling) |
0.24
|
| The authors elicited additional insights via a survey of paper authors plus follow-up interviews to collect self-assessments of reproducibility and qualitative explanations for obstacles and motivations. Research Productivity | null_result | high | use of surveys and interviews as data sources for qualitative corroboration and explanation |
0.24
|
| Reproducibility is a practical and valuable goal for the HCI field even where full independent replication remains contested. Research Productivity | positive | medium | assessment of reproducibility's attainability and value (conceptual/argumentative claim rather than an empirically measured outcome) |
0.14
|