← Papers

Jointly training production and transport agents boosts scheduling performance in many simulated factories, but offers little benefit when severe transport or processing bottlenecks make one task dominant. In those bottlenecked settings, cheaper modular training or rule-based approaches can be a practical alternative.

An Analysis of the Coordination Gap between Joint and Modular Learning for Job Shop Scheduling with Transportation Resources

Moritz Link, Jonathan Hoss, Noah Klarmann · April 27, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Joint training of production and transport agents via multi-agent reinforcement learning typically outperforms modular training and dispatching rules, but its advantage fades in bottlenecked environments where one scheduling task (transport or processing) dominates.

Efficient job-shop scheduling with transportation resources is critical for high-performance manufacturing. With the rise of "decentralized factories", multi-agent reinforcement learning has emerged as a promising approach for the combined scheduling of production and transportation tasks. Prior work has largely focused on developing novel cooperative architectures while overlooking the question of when joint training is necessary. Joint training denotes the simultaneous training of job and automatic guided vehicle scheduling agents, whereas modular training involves independently training each agent followed by post-hoc integration. In this study, we systematically investigate the conditions under which joint training is essential for optimal performance in the job-shop scheduling problem with transportation resources. Through a rigorous sensitivity analysis of resource scarcity and temporal dominance, we quantify the coordination gap -- the performance difference between these two training modalities. In our evaluation, the joint training can produce superior performance compared to the best-performing combinations of dispatching rules and modular training. However, the coordination gap advantage diminishes in bottleneck environments, particularly under severe transport and processing constraints. These findings indicate that modular training represents a viable alternative in environments where a single scheduling task dominates. Overall, our work provides practical guidance for selecting between training modalities based on environmental conditions, enabling decision-makers to optimize reinforcement learning-based scheduling performance.

Summary

Main Finding

Joint (simultaneous) multi-agent reinforcement learning of job and AGV schedulers can outperform the best modular combinations (independently trained agent + dispatching rule) for the job-shop scheduling problem with transportation (JSSPT). However, the advantage — the "coordination gap" — is highly dependent on environment characteristics: it is largest in intermediate settings where neither transportation nor processing strictly dominates, and it shrinks (often substantially) in bottlenecked environments under severe transport or processing constraints. Thus, modular training is often a viable, lower-cost alternative when a single scheduling task dominates.

Key Points

Problem: Job-shop scheduling with transportation resources (JSSPT): minimize makespan while jointly scheduling machine operations and AGV transport tasks.
Two training modalities compared:
- Joint training: both job-scheduler and AGV-scheduler learned concurrently (MAPPO).
- Modular training: train one RL agent while the other decision is provided by a dispatching rule (DR); then combine trained agents post-hoc.
Coordination gap: performance difference between joint and modular/DR-based solvers; quantified via Relative Percentage Increase (RPI) and win rate (WR).
Environment sensitivity:
- Resource scarcity (ρ = k/n, AGVs per job) strongly affects need for joint learning.
- Temporal-dominance index (τ) captures whether processing or transport dominates schedule time; coordination value is non-linear in τ.
Practical outcome: joint training gives the greatest benefit when transport and processing durations are balanced and AGV resources are neither extremely scarce nor abundant; when a single task (transport or processing) dominates, modular approaches often match joint performance.

Data & Methods

MDP formulation:
- Agents: job scheduler and AGV scheduler (N=2 for multi-agent; single-agent variants for modular training).
- State: disjunctive graph for operations and machines (G = {V, E}) for job scheduler; per-AGV feature vector (EPUT, EST, ERT, TTS, EAT, EFT) for AGV scheduler. Features normalized/scaled.
- Actions: select an unscheduled operation and assign an AGV (joint action = operation × AGV).
- Reward: sparse end-of-episode reward equal to −Cmax (negative makespan), scaled by lower bound and a factor s (s = 5).
Model architectures:
- Job scheduler: Graph Isomorphism Network (GIN) encoder (L = 2, hidden dim 64), MLP decoder producing logits over operations.
- AGV scheduler: three-layer MLP (first two layers dim 16), outputs logits over AGVs.
- Critic: GIN encoder + MLP on global graph embedding for value estimation.
Baselines:
- 10 operation dispatching rules (SPT, LPT, MWR, LWR, FDD/MWR, MOR, LOR, SMPT, random, FCFS).
- 4 AGV rules (random, SPUT, SCTA, SCPT).
- Modular solvers built by combining DRs with the trained agent in both roles (job or AGV).
Experimental instance generation:
- Operation processing and transport times sampled DU(1,100).
- Instance sizes used for training: 6×6, 10×10, 15×10, 20×5, 30×10 (jobs × machines).
- Number of AGVs k sampled from DU(3, n).
Evaluation metrics:
- Relative Percentage Increase (RPI) vs baselines and Win Rate (WR) over instance sets.
- Sensitivity analysis across resource scarcity ρ (k/n) and temporal-dominance τ* (derived from normalized average processing vs transport times).
Training details / hyperparameters:
- PPO / MAPPO backbone, total frames 4×10^6, Adam lr 3e-4 with linear decay.
- γ = 0.999, GAE λ = 1.0, clipping ϵ = 0.2, entropy coeff 0.01, critic coeff 0.5.
- Rollout design: batches with 4 completed episodes; sparse terminal rewards necessitate high γ.
Main experimental finding (qualitative):
- Joint solver (learned GNN + MLP jointly) often yields better makespan than modular combinations and DR baselines.
- The joint advantage shrinks in bottlenecked regimes (extreme ρ or extreme τ*), making modular/DR-based solutions competitive.

Implications for AI Economics

Cost–benefit trade-off for training modality:
- Joint training requires more coordination in development, centralized training resources, and potentially more expensive retraining when components change — but can yield measurable performance gains in settings where coordination matters.
- Modular training lowers integration and replacement costs, fits decentralized vendor stacks, and can achieve near-joint performance in bottleneck-dominated environments.
Procurement and investment decisions:
- Firms should assess their operational regime (AGV availability and whether transport or processing dominates). If the factory operates in the balanced/intermediate regime identified by the paper, investing in joint MARL development is more likely to pay off.
- In environments with clear bottlenecks (severe transport scarcity or processing-dominant workflows), cheaper modular approaches or rule-based AGV dispatchers may be economically preferable.
Market and vendor implications:
- Modular solutions support vendor specialization and easier component replacement, fostering competitive modular markets for schedulers and physical assets.
- Joint MARL solutions, while potentially higher-performing in some regimes, create stronger lock-in and raise the switching cost for vendor or component replacement.
Policy and operational planning:
- When expanding AGV fleets or changing machine layouts, evaluate impact on ρ and τ* — small changes in resource balance can alter the value of investing in coordinated RL systems.
Future economic opportunities:
- Hybrid approaches (pretrain modular agents, then fine-tune jointly) or transfer learning across factories could capture most coordination gains while limiting retraining costs — an area where economic gains from reduced retraining and higher robustness can be realized.
Limitations to consider for economic interpretation:
- Results are from simulated instances with specific time-distributions and instance sizes; real-world heterogeneity, stochastic failures, and overheads (communication, latency) may change the coordination value.
- Quantitative payback (i.e., how much money you save per unit of makespan improvement) must be computed case-by-case to inform investment decisions.

If you want, I can (a) extract actionable decision rules from the paper (e.g., thresholds of ρ and τ* where modular wins), (b) sketch a cost model comparing centralized joint training vs modular deployment, or (c) highlight experimental plots/figures to inspect if you provide them.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The study uses controlled, systematic simulation experiments that allow clear comparison of training modalities and isolation of key environment dimensions (transport vs processing constraints), providing internally valid evidence about algorithmic performance; however, results are limited to simulated settings, particular task formulations, agent architectures, and reward specifications and lack field validation or real-world operational data. Methods Rigormedium — The paper conducts a rigorous sensitivity analysis across core environmental factors and compares multiple baselines (modular training and dispatching rules), which indicates solid experimental design; nevertheless, methodological shortcomings include reliance on simulation benchmarks without reported statistical uncertainty measures, potential sensitivity to implementation details (architecture, hyperparameters, training regime), and unclear coverage of scale/stochasticity that affect external validity. SampleSynthetic job-shop scheduling environments with explicit transportation resources (automatic guided vehicles); experiments span a range of resource-scarcity and temporal-dominance settings to create bottleneck and non-bottleneck scenarios; evaluated agents include jointly trained multi-agent RL schedulers, modularly trained agents integrated post-hoc, and standard dispatching-rule baselines (exact instance counts, seed variability, and hyperparameter ranges not specified in the summary). Themesproductivity adoption IdentificationControlled simulation experiments and systematic sensitivity analysis: the authors vary environmental parameters (resource scarcity and temporal dominance) in synthetic job-shop scheduling environments to compare performance of joint multi-agent RL training versus modular training and baseline dispatching rules, measuring the 'coordination gap' across scenario sweeps. GeneralizabilityResults are from simulated job-shop benchmarks and may not transfer to real factories with unmodeled idiosyncrasies (machine failures, human operators, maintenance schedules)., Findings depend on specific agent architectures, reward functions, and training hyperparameters which may change relative performance., Scalability to larger or more heterogeneous production systems is unclear., Economic costs of training, deployment, and integration (compute, downtime, retraining) are not evaluated., Stochasticity and adversarial disruptions common in real operations may alter coordination requirements and benefits of joint training.

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Joint training can produce superior performance compared to the best-performing combinations of dispatching rules and modular training. Task Completion Time	positive	high	scheduling performance (e.g., makespan / throughput / overall schedule quality)	0.18
The coordination gap advantage (between joint and modular training) diminishes in bottleneck environments, particularly under severe transport and processing constraints. Task Allocation	negative	high	coordination gap (performance difference between training modalities)	0.18
Modular training represents a viable alternative in environments where a single scheduling task dominates. Task Allocation	positive	high	relative scheduling performance (modular vs joint training)	0.18
Through a rigorous sensitivity analysis of resource scarcity and temporal dominance, we quantify the coordination gap. Task Allocation	null_result	high	coordination gap	0.09
Multi-agent reinforcement learning has emerged as a promising approach for the combined scheduling of production and transportation tasks in decentralized factories. Organizational Efficiency	positive	high	potential improvement in scheduling/operational efficiency	0.09
Prior work has largely focused on developing novel cooperative architectures while overlooking the question of when joint training is necessary. Research Productivity	negative	high	research focus (coverage of training-modality necessity in prior literature)	0.09
The paper's findings provide practical guidance for selecting between joint and modular training modalities based on environmental conditions to optimize reinforcement learning–based scheduling performance. Organizational Efficiency	positive	medium	guidance effectiveness for selecting training modality to optimize performance	0.02