Fine-tuned language models can mimic journals' editorial "taste", predicting publication outcomes far better than zero-shot AIs and even panels of editors; models trained on management records reached up to 59% accuracy and those trained on economics records about 70%, though they risk entrenching historical biases if deployed without safeguards.
Artificial intelligence matches or exceeds human performance on tasks with verifiable answers, from protein folding to Olympiad mathematics. Yet the capacity that most governs scientific advance is not reasoning but taste: the ability to judge which untested ideas deserve pursuit, exercised daily by editors and funders but never successfully articulated, taught, or automated. Here we show that fine-tuning language models on journal publication decisions recovers evaluative judgment inaccessible to both frontier models and human expertise. Using a held-out benchmark of research pitches in management spanning four quality tiers, we find that eleven frontier models, spanning major proprietary and open architectures, barely exceed chance, averaging 31% accuracy. Panels of journal editors and editorial board members reach 42% by majority vote. Fine-tuned models trained on years of publication records each surpass every frontier model and expert panel, with the best single model achieving 59%. These models exhibit calibrated confidence, reaching 100% accuracy on their highest-confidence predictions, and transfer this evaluative signal to untrained pairwise comparisons and one-sentence summaries. The mechanism generalizes: models trained on economics publication records achieve 70% accuracy. Scientific taste was not missing from AI's reach; it was deposited in the institutional record, waiting to be extracted. These results provide a scalable mechanism to triage the expanding volume of scientific production across disciplines where quality resists formal verification.
Summary
Main Finding
Fine-tuning language models on historical journal publication decisions recovers an evaluative "scientific taste" that both frontier (zero-shot) models and expert editor panels cannot reliably reproduce. Fine-tuned models trained on publication records outperform state-of-the-art models and human editors at predicting which research pitches will be published, generalize across related evaluative tasks, and can be trained in other fields (economics) with even higher accuracy.
Key Points
- "Taste" here = the ability to judge which untested research ideas deserve pursuit — a judgment exercised by editors and funders but previously not articulated or automated.
- Eleven frontier language models (proprietary and open) averaged 31% accuracy on a held-out four-tier benchmark of management research pitches (chance ≈25%); this is only marginally above chance.
- Panels of journal editors and editorial board members reach 42% accuracy by majority vote on the same benchmark.
- Fine-tuned models trained on years of journal publication decisions (i.e., institutional accept/reject records) each outperform every frontier model and the expert panel; the best single model achieves 59% accuracy.
- Models show well-calibrated confidence: their highest-confidence predictions are 100% accurate.
- The learned evaluative signal transfers to untrained tasks such as pairwise comparisons and one-sentence summaries.
- The mechanism generalizes to another field: models trained on economics publication records reach 70% accuracy on a similar benchmark.
Data & Methods
- Benchmark: held-out set of research pitches in management labeled into four quality tiers (used to evaluate predictive performance).
- Human baseline: panels of journal editors and editorial board members whose majority-vote accuracy was measured on the benchmark.
- Frontier baseline: eleven state-of-the-art language models across major architectures evaluated zero-shot (or as provided) on the benchmark; averaged 31% accuracy.
- Training signal: historical publication records (journal decisions/accept-reject outcomes) used to fine-tune language models to predict editorial decisions.
- Evaluation metrics: tier prediction accuracy on held-out benchmark; calibration of confidence (e.g., accuracy among highest-confidence predictions); transfer performance on related tasks (pairwise comparisons, one-sentence summaries).
- Cross-field test: analogous fine-tuning on economics publication records to test generality (yielding ~70% accuracy).
Implications for AI Economics
- Practical applications
- Scalable triage for growing volumes of submissions (journals, conferences, working papers, grant proposals): prioritize promising work for human review.
- Screening and routing: assign reviewers/editors or highlight high-value ideas for funding agencies, labs, and research programs.
- Discovery and curation: help identify under-the-radar high-potential projects and reduce noise in information flows.
- Opportunities for economic research
- Study how institutional records encode preferences and how extracting them changes incentives and the evolution of research topics.
- Analyze labor effects: potential shifts in demand for editorial/review labor and changes to career dynamics for researchers.
- Model-based policy experiments: simulate how automated triage would affect diversity, innovation, and disciplinary concentration.
- Risks and cautions
- Historical bias capture: models learn past editorial/funder preferences, which may encode conservatism, status biases, or other distortions; deploying them can entrench those biases.
- Feedback loops: automated triage changes submission and funding behavior, potentially amplifying certain topics or methods.
- Concentration of influence: institutions or firms controlling such models could shape research agendas at scale.
- Over-reliance and automation bias: high-performing models should augment, not replace, human judgment; require human oversight and appeals.
- Need for transparency and auditing: training data provenance, fairness audits, and ongoing evaluation across subfields are essential.
- Implementation suggestions
- Use as decision support (prioritization/triage), not definitive gatekeeping.
- Maintain diverse training sources and regular retraining to mitigate path dependence.
- Combine model signals with structured human review, and monitor downstream effects on research diversity and innovation.
Summary takeaway: scientific "taste" is extractable from institutional publication records via fine-tuning; this offers powerful, scalable tools for filtering and prioritizing research—but deployment in economics (and other fields) requires careful design to avoid reinforcing historical biases and unintended incentive effects.
Assessment
Claims (9)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Fine-tuning language models on historical journal publication decisions recovers an evaluative "scientific taste" that frontier (zero-shot) models and expert editor panels cannot reliably reproduce. Decision Quality | positive | high | Ability to predict publication-worthiness as measured by tier prediction accuracy on a held-out four-tier benchmark of research pitches |
fine-tuning on historical publication decisions improves prediction of publication-worthiness (tier prediction accuracy)
0.18
|
| Eleven frontier language models (proprietary and open) averaged 31% accuracy on a held-out four-tier benchmark of management research pitches (chance ≈25%); this is only marginally above chance. Decision Quality | negative | high | Accuracy on the four-tier management research-pitch benchmark |
average accuracy of eleven frontier LMs = 31% (chance ≤25%)
0.18
|
| Panels of journal editors and editorial board members reach 42% accuracy by majority vote on the same four-tier benchmark. Decision Quality | positive | high | Majority-vote accuracy on the four-tier management research-pitch benchmark |
journal-editor panels (majority vote) accuracy = 42%
0.18
|
| Fine-tuned models trained on publication records each outperform every frontier model and the expert panel; the best single model achieves 59% accuracy on the benchmark. Decision Quality | positive | high | Accuracy on the four-tier management research-pitch benchmark |
fine-tuned models outperform frontier models and human panel; best single model = 59% accuracy
0.18
|
| The learned evaluative signal transfers to untrained tasks such as pairwise comparisons and one-sentence summaries. Decision Quality | positive | medium | Performance (transfer) on pairwise-comparison and one-sentence-summary evaluative tasks |
learned evaluative signal transfers to pairwise comparisons and one-sentence summaries (positive transfer reported)
0.11
|
| Models show well-calibrated confidence: their highest-confidence predictions are 100% accurate. Decision Quality | positive | medium | Calibration accuracy (accuracy among highest-confidence predictions) |
models' highest-confidence predictions are 100% accurate
0.11
|
| The mechanism generalizes to another field: models trained on economics publication records reach ~70% accuracy on a similar benchmark. Decision Quality | positive | high | Accuracy on an economics research-pitch benchmark |
economics-domain fine-tuned models reach ≈70% accuracy on similar benchmark
0.18
|
| Frontier language models and human editors do not reliably reproduce the evaluative signal contained in institutional publication records. Decision Quality | negative | high | Relative prediction accuracy on held-out benchmark(s) of research-pitch quality |
frontier LMs and human editors do not match fine-tuned models' predictive performance on held-out benchmarks
0.18
|
| Historical institutional publication records encode an extractable evaluative signal ("taste") that can be learned by models and used for scalable triage, screening, and curation of submissions. Decision Quality | positive | medium | Extractability of evaluative signal as operationalized by improved predictive accuracy after fine-tuning on publication records |
historical publication records encode extractable evaluative signal as evidenced by improved predictive accuracy after fine-tuning
0.11
|