Fine-tuned language models can mimic journals' editorial "taste", predicting publication outcomes far better than zero-shot AIs and even panels of editors; models trained on management records reached up to 59% accuracy and those trained on economics records about 70%, though they risk entrenching historical biases if deployed without safeguards.

Machines acquire scientific taste from institutional traces

Ziqin Gong, Ning Li, Huaikang Zhou · March 17, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Fine-tuning language models on historical journal accept/reject records produces models that predict which research pitches will be published far better than zero-shot frontier models and outperform majority-vote panels of editors, with good calibration and transfer across related tasks and fields.

Artificial intelligence matches or exceeds human performance on tasks with verifiable answers, from protein folding to Olympiad mathematics. Yet the capacity that most governs scientific advance is not reasoning but taste: the ability to judge which untested ideas deserve pursuit, exercised daily by editors and funders but never successfully articulated, taught, or automated. Here we show that fine-tuning language models on journal publication decisions recovers evaluative judgment inaccessible to both frontier models and human expertise. Using a held-out benchmark of research pitches in management spanning four quality tiers, we find that eleven frontier models, spanning major proprietary and open architectures, barely exceed chance, averaging 31% accuracy. Panels of journal editors and editorial board members reach 42% by majority vote. Fine-tuned models trained on years of publication records each surpass every frontier model and expert panel, with the best single model achieving 59%. These models exhibit calibrated confidence, reaching 100% accuracy on their highest-confidence predictions, and transfer this evaluative signal to untrained pairwise comparisons and one-sentence summaries. The mechanism generalizes: models trained on economics publication records achieve 70% accuracy. Scientific taste was not missing from AI's reach; it was deposited in the institutional record, waiting to be extracted. These results provide a scalable mechanism to triage the expanding volume of scientific production across disciplines where quality resists formal verification.

Summary

Main Finding

Fine-tuning language models on historical journal publication decisions recovers an evaluative "scientific taste" that both frontier (zero-shot) models and expert editor panels cannot reliably reproduce. Fine-tuned models trained on publication records outperform state-of-the-art models and human editors at predicting which research pitches will be published, generalize across related evaluative tasks, and can be trained in other fields (economics) with even higher accuracy.

Key Points

"Taste" here = the ability to judge which untested research ideas deserve pursuit — a judgment exercised by editors and funders but previously not articulated or automated.
Eleven frontier language models (proprietary and open) averaged 31% accuracy on a held-out four-tier benchmark of management research pitches (chance ≈25%); this is only marginally above chance.
Panels of journal editors and editorial board members reach 42% accuracy by majority vote on the same benchmark.
Fine-tuned models trained on years of journal publication decisions (i.e., institutional accept/reject records) each outperform every frontier model and the expert panel; the best single model achieves 59% accuracy.
Models show well-calibrated confidence: their highest-confidence predictions are 100% accurate.
The learned evaluative signal transfers to untrained tasks such as pairwise comparisons and one-sentence summaries.
The mechanism generalizes to another field: models trained on economics publication records reach 70% accuracy on a similar benchmark.

Data & Methods

Benchmark: held-out set of research pitches in management labeled into four quality tiers (used to evaluate predictive performance).
Human baseline: panels of journal editors and editorial board members whose majority-vote accuracy was measured on the benchmark.
Frontier baseline: eleven state-of-the-art language models across major architectures evaluated zero-shot (or as provided) on the benchmark; averaged 31% accuracy.
Training signal: historical publication records (journal decisions/accept-reject outcomes) used to fine-tune language models to predict editorial decisions.
Evaluation metrics: tier prediction accuracy on held-out benchmark; calibration of confidence (e.g., accuracy among highest-confidence predictions); transfer performance on related tasks (pairwise comparisons, one-sentence summaries).
Cross-field test: analogous fine-tuning on economics publication records to test generality (yielding ~70% accuracy).

Implications for AI Economics

Practical applications
- Scalable triage for growing volumes of submissions (journals, conferences, working papers, grant proposals): prioritize promising work for human review.
- Screening and routing: assign reviewers/editors or highlight high-value ideas for funding agencies, labs, and research programs.
- Discovery and curation: help identify under-the-radar high-potential projects and reduce noise in information flows.
Opportunities for economic research
- Study how institutional records encode preferences and how extracting them changes incentives and the evolution of research topics.
- Analyze labor effects: potential shifts in demand for editorial/review labor and changes to career dynamics for researchers.
- Model-based policy experiments: simulate how automated triage would affect diversity, innovation, and disciplinary concentration.
Risks and cautions
- Historical bias capture: models learn past editorial/funder preferences, which may encode conservatism, status biases, or other distortions; deploying them can entrench those biases.
- Feedback loops: automated triage changes submission and funding behavior, potentially amplifying certain topics or methods.
- Concentration of influence: institutions or firms controlling such models could shape research agendas at scale.
- Over-reliance and automation bias: high-performing models should augment, not replace, human judgment; require human oversight and appeals.
- Need for transparency and auditing: training data provenance, fairness audits, and ongoing evaluation across subfields are essential.
Implementation suggestions
- Use as decision support (prioritization/triage), not definitive gatekeeping.
- Maintain diverse training sources and regular retraining to mitigate path dependence.
- Combine model signals with structured human review, and monitor downstream effects on research diversity and innovation.

Summary takeaway: scientific "taste" is extractable from institutional publication records via fine-tuning; this offers powerful, scalable tools for filtering and prioritizing research—but deployment in economics (and other fields) requires careful design to avoid reinforcing historical biases and unintended incentive effects.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides strong predictive evidence: held-out benchmark evaluation, comparisons to multiple frontier models and to human editor panels, calibration analyses, and cross-field transfer (management → economics). However, the claim that models recover a meaningful notion of "scientific taste" depends on using historical accept/reject decisions as a proxy for quality; publication decisions can encode biases, noise, and strategic/institutional constraints. The results are compelling for predictive performance but do not establish causal claims about downstream economic effects or the normative quality of the recovered signal. Methods Rigormedium — The study uses reasonable and standard evaluation methods (held-out test set, human baselines, multiple model baselines, calibration metrics, transfer learning tests). Rigor is strengthened by cross-field replication and human-panel comparisons. Missing or unclear elements that lower the rating include limited detail on dataset curation (which journals, time periods, selection of pitches), potential label and selection biases, possible leakage between training and test data, and absence of randomized or prospective validation in live editorial workflows. SampleHeld-out benchmark of management research "pitches" labeled into four quality tiers; training data consist of years of journal publication records (accept/reject outcomes) used to fine-tune LMs; human baseline comprised panels of journal editors/editorial board members whose majority-vote accuracy was measured on the same benchmark; frontier baseline comprised eleven state-of-the-art language models evaluated zero-shot; additional experiments fine-tuned models on economics publication records for cross-field testing. Themeshuman_ai_collab org_design GeneralizabilityDepends on which journals and time periods were included—may not generalize beyond the sampled journals (e.g., top-tier vs. broad-field outlets)., Publication decisions are an imperfect proxy for intrinsic research quality and may encode conservatism, status, or methodological preferences that vary across disciplines and cultures., Performance may vary for different pitch formats, lengths, languages, or for full papers versus short pitches., Models trained on historical records may not generalize to future policy or institutional changes (temporal drift)., Results focus on management and economics—other fields with different review norms may show different outcomes.

Claims (9)

Claim	Direction	Confidence	Outcome	Details
Fine-tuning language models on historical journal publication decisions recovers an evaluative "scientific taste" that frontier (zero-shot) models and expert editor panels cannot reliably reproduce. Decision Quality	positive	high	Ability to predict publication-worthiness as measured by tier prediction accuracy on a held-out four-tier benchmark of research pitches	fine-tuning on historical publication decisions improves prediction of publication-worthiness (tier prediction accuracy) 0.18
Eleven frontier language models (proprietary and open) averaged 31% accuracy on a held-out four-tier benchmark of management research pitches (chance ≈25%); this is only marginally above chance. Decision Quality	negative	high	Accuracy on the four-tier management research-pitch benchmark	average accuracy of eleven frontier LMs = 31% (chance ≤25%) 0.18
Panels of journal editors and editorial board members reach 42% accuracy by majority vote on the same four-tier benchmark. Decision Quality	positive	high	Majority-vote accuracy on the four-tier management research-pitch benchmark	journal-editor panels (majority vote) accuracy = 42% 0.18
Fine-tuned models trained on publication records each outperform every frontier model and the expert panel; the best single model achieves 59% accuracy on the benchmark. Decision Quality	positive	high	Accuracy on the four-tier management research-pitch benchmark	fine-tuned models outperform frontier models and human panel; best single model = 59% accuracy 0.18
The learned evaluative signal transfers to untrained tasks such as pairwise comparisons and one-sentence summaries. Decision Quality	positive	medium	Performance (transfer) on pairwise-comparison and one-sentence-summary evaluative tasks	learned evaluative signal transfers to pairwise comparisons and one-sentence summaries (positive transfer reported) 0.11
Models show well-calibrated confidence: their highest-confidence predictions are 100% accurate. Decision Quality	positive	medium	Calibration accuracy (accuracy among highest-confidence predictions)	models' highest-confidence predictions are 100% accurate 0.11
The mechanism generalizes to another field: models trained on economics publication records reach ~70% accuracy on a similar benchmark. Decision Quality	positive	high	Accuracy on an economics research-pitch benchmark	economics-domain fine-tuned models reach ≈70% accuracy on similar benchmark 0.18
Frontier language models and human editors do not reliably reproduce the evaluative signal contained in institutional publication records. Decision Quality	negative	high	Relative prediction accuracy on held-out benchmark(s) of research-pitch quality	frontier LMs and human editors do not match fine-tuned models' predictive performance on held-out benchmarks 0.18
Historical institutional publication records encode an extractable evaluative signal ("taste") that can be learned by models and used for scalable triage, screening, and curation of submissions. Decision Quality	positive	medium	Extractability of evaluative signal as operationalized by improved predictive accuracy after fine-tuning on publication records	historical publication records encode extractable evaluative signal as evidenced by improved predictive accuracy after fine-tuning 0.11