OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces and multi-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.

Summary

Main Finding

OR-Space is a new, public benchmark that evaluates LLM-based agents on full-lifecycle industrial optimization work inside persistent, multi-artifact workspaces. By moving beyond one-shot NL→model tests to three lifecycle modes—Build, Revise, Explain—OR-Space reveals reliability gaps and realistic failure modes (data grounding, cross-file consistency, preserving prior logic, solver-feedback interpretation, grounded explanations) that are hidden by conventional self-contained benchmarks. Experiments on 20 models (closed- and open-source) across 100 problem instances per mode show (1) agents often do better at targeted revisions than building models from scratch, and (2) explanation performance requires both solver grounding and evidence-faithful reasoning.

Key Points

Workspace formalization: OR-Space represents each instance as an executable workspace W = ⟨D, P, S, E, M⟩:
- D: documents (requirements, change requests), P: parameter artifacts (CSV/JSON), S: code artifacts (templates/legacy code), E: runtime environment (Docker sandbox), M: hidden evaluators.
Three lifecycle task modes:
- Build: construct a solver-ready model from business docs + parameter files + empty scaffold.
- Revise: modify an existing model/code to meet changed requirements while preserving unaffected logic (variants: Revise-code, Revise-model, Revise-all).
- Explain: produce grounded, evidence-linked explanations using solver outputs (duals, slacks, logs).
Benchmark scale and construction:
- Extends 100 IndustryOR base problems into 300 instances (100 per mode).
- Two-stage generation: clean mathematical spec → rewrite into realistic business artifacts (noisy language, inconsistent schemas). OR researchers review instances.
Evaluation protocol:
- Build/Revise are solver-scored: submission must run (pulp.LpProblem interface), produce Optimal status, and match oracle objective within relative error ε = 1e-2.
- Default solver: Gurobi (with cross-solver validation variants); runtime sandbox, 120s execution cap.
- Explain scored via a grounded rubric (Exact Coverage, Reasoning, Grounding, Answer Quality, Hallucination penalty) with an LLM judge for rubric items.
Empirical findings (high-level):
- Many models achieved substantially higher pass rates on Revise than Build (ΔR−B often positive), indicating revision tasks are easier for agents than constructing models from scattered artifacts.
- Top performers (examples): gemini-3.1-pro (Build 72%, Revise 81%, Explain 73/100), gpt-5.4 (Build 59%, Revise 79%, Explain 86.5/100).
- Failure modes logged: runtime errors, infeasible or suboptimal formulations, missing constraints, incorrect data grounding, incomplete revisions, and hallucinated explanations.
Public release: code and dataset available (GitHub, Hugging Face).

Data & Methods

Source data: 100 problems from IndustryOR expanded into multi-artifact workspaces. Each instance includes:
- Natural-language documents with business requirements and revision notes.
- Parameter files (CSV/JSON) intentionally varied (missing values, inconsistent schemas).
- Optional code artifacts (legacy heuristics, partial models) for Revise tasks.
- Oracle algebraic model and executable reference (hidden) for evaluation.
Execution environment:
- Isolated Docker sandbox per agent with read/write limited to workspace subtree; no network access.
- Submissions execute in a fresh Python interpreter; solver attached at runtime; Gurobi 12.0.1 default.
- All models must expose modeling via pulp.LpProblem to decouple modeling quality from solver idiosyncrasies.
Evaluation specifics:
- Build/Revise: Pass if execution succeeds, solver returns Optimal, and objective within relative tolerance ε = 0.01 vs oracle.
- Explain: 5-dim rubric (Exact Coverage 35, Reasoning 35, Grounding 20, Answer Quality 10, minus Hallucination up to 12); judge model scores grounded items after text normalization; exact_match items evaluated programmatically.
Experimental setup:
- 20 models evaluated (mix of closed/open, reasoning-enabled variants where available). API calls used low temperature; code-generation token caps defined.
- Diagnostic logs record failure types for attribution.

Implications for AI Economics

Realistic evaluation matters for economic decision-support agents:
- Economic modeling workflows (supply-chain optimization, capacity planning, network design, auction/mechanism calibration) mirror OR lifecycles—agents will need workspace grounding, schema alignment, iterative revision safety, and solver-grounded explanations to be useful in practice.
- Self-contained NL→model benchmarks can overestimate deployed capability; procurement, ROI estimates, and operational risk assessments should use lifecycle-style benchmarks (like OR-Space) to gauge readiness.
Value and risk assessment:
- Agents perform relatively well at revising existing models, suggesting near-term productivity gains from agent-assisted maintenance and incremental updates rather than fully automated end-to-end modeling.
- Persistent failure modes (data grounding, unnoticed constraint omissions, unfaithful explanations) create audit and liability risk. Firms should require model-execution checks, objective-based validation, and evidence-linked explanations prior to action.
Market and labor effects:
- Demand likely to grow for LLMs and toolchains specialized to OR/economic modeling (fine-tuned models, plugin toolkits, integrated verifiers). Specialized models may command premium pricing.
- Automation will shift OR engineers’ roles toward supervision, specification/design, and validation. Productivity gains will be concentrated where legacy models exist and routine revisions dominate.
Research & product priorities to improve economic decision-making agents:
- Multi-artifact grounding and robust schema-extraction methods (for noisy enterprise data).
- Revision-safe model editing (mechanisms to preserve invariants, formal verification of unchanged logic).
- Tight solver-agent feedback loops (interpretation of infeasibility, sensitivity/duality analysis, Big-M diagnostics).
- Faithful, evidence-grounded explanations that link model code, data cells, and solver signals to business implications—critical for stakeholder trust and regulatory compliance.
Policy, governance, and auditability:
- Benchmarks like OR-Space provide concrete protocols for auditing agent behavior on objective-grounded tasks; regulators and auditors can use such benchmarks to require demonstrable, solver-verified correctness and provenance for AI-assisted economic decisions.
- Explanation rubrics tied to workspace evidence support traceability requirements for high-stakes economic applications.

Source & resources - Paper: OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents (Zhou et al., 2026). - Code & data: GitHub and Hugging Face links provided by the authors.

Assessment

Paper Typedescriptive Evidence Strengthn/a — This paper presents a benchmark and evaluation protocol rather than making causal claims about economic outcomes; it does not provide empirical identification of causal effects. Methods Rigormedium — The benchmark design is systematic—defining task modes (Build, Revise, Explain), executable multi-artifact workspaces, task-specific evaluators, and a quality-control pipeline—but the paper (as described) does not report extensive real-world validation, large-scale field deployments, or sensitivity analyses of evaluator reliability and domain coverage, which limits methodological rigor assessment. SampleA curated set of executable OR workspaces where each instance contains business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators across interdependent files; tasks span three modes (Build, Revise, Explain) and are drawn to represent industrial optimization workflows (specific dataset size, domains, and provenance not specified in the summary). Themeshuman_ai_collab productivity GeneralizabilityMay rely on synthetic or researcher-curated problem instances rather than broad, real-world firm data, Domain coverage may be limited to particular OR problem classes and industries represented in the benchmark, Performance may depend on particular solvers, file formats, or artifact conventions used in the workspaces, Benchmarks of agent capability do not directly translate to measured productivity or economic impact in operational settings, Language, regulatory, and organizational heterogeneity across firms may limit external validity

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling. Adoption Rate	positive	high	LLM agent adoption in OR workflows	0.18
Existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Other	negative	high	scope/coverage of OR benchmarks	0.18
Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces and multi-stage task lifecycles. Other	negative	high	realism of benchmark scenarios relative to industrial workflows	0.18
We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Other	positive	high	capability of benchmarks to evaluate OR agents across lifecycle tasks	0.3
Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. Other	positive	high	complexity and composition of benchmark instances	0.3
OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts. Other	positive	high	ability to construct solver-ready models	0.3
OR-Space defines a Revise task mode, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic. Other	positive	high	ability to revise models while preserving prior logic	0.3
OR-Space defines an Explain task mode, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. Other	positive	high	ability to generate grounded explanations using workspace evidence	0.3
By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. Organizational Efficiency	positive	high	reliability of LLM agents in performing optimization work (beyond text generation)	0.18
We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows. Other	positive	high	capability to study reliability, failure modes, and readiness of LLM agents	0.3

OR-Space exposes whether LLM agents can do real optimization work by replacing one-shot puzzles with persistent, multi-file workspaces and lifecycle tasks; agents must now build solver-ready models, revise them under changing requirements, and produce grounded explanations using dispersed evidence.