A new MessyKitchens dataset and a Multi-Object Decoder sharply improve single-image multi-object 3D reconstruction—cutting registration errors and object interpenetration across benchmarks; by lowering perception cost and improving physical plausibility, the open release could accelerate robotics, AR/VR and related automation deployments.

MessyKitchens: Contact-rich object-level 3D scene reconstruction

Junaid Ahmed Ansari, Ran Ding, Fabio Pizzati, Ivan Laptev · March 17, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

MessyKitchens, a high-fidelity dataset of cluttered kitchen scenes, together with a Multi-Object Decoder built on SAM 3D, significantly improves single-image multi-object 3D reconstruction accuracy and reduces inter-object penetration across three benchmarks.

Monocular 3D scene reconstruction has recently seen significant progress. Powered by the modern neural architectures and large-scale data, recent methods achieve high performance in depth estimation from a single image. Meanwhile, reconstructing and decomposing common scenes into individual 3D objects remains a hard challenge due to the large variety of objects, frequent occlusions and complex object relations. Notably, beyond shape and pose estimation of individual objects, applications in robotics and animation require physically-plausible scene reconstruction where objects obey physical principles of non-penetration and realistic contacts. In this work we advance object-level scene reconstruction along two directions. First, we introduceMessyKitchens, a new dataset with real-world scenes featuring cluttered environments and providing high-fidelity object-level ground truth in terms of 3D object shapes, poses and accurate object contacts. Second, we build on the recent SAM 3D approach for single-object reconstruction and extend it with Multi-Object Decoder (MOD) for joint object-level scene reconstruction. To validate our contributions, we demonstrate MessyKitchens to significantly improve previous datasets in registration accuracy and inter-object penetration. We also compare our multi-object reconstruction approach on three datasets and demonstrate consistent and significant improvements of MOD over the state of the art. Our new benchmark, code and pre-trained models will become publicly available on our project website: https://messykitchens.github.io/.

Summary

Main Finding

The paper introduces MessyKitchens, a high-fidelity real-world dataset of cluttered indoor scenes with object-level 3D ground truth (shapes, poses, and accurate object contacts), and proposes a Multi-Object Decoder (MOD) built on SAM 3D to perform joint object-level monocular 3D scene reconstruction. The dataset and MOD together yield materially better registration accuracy and far less inter-object penetration than prior datasets and single-object methods, with consistent improvements demonstrated across three benchmarks.

Key Points

Contribution 1: MessyKitchens dataset — real-world cluttered kitchen scenes with object-level 3D shapes, poses, and explicit contact information between objects (improves realism and physical correctness relative to prior datasets).
Contribution 2: Multi-Object Decoder (MOD) — extends a recent single-object monocular 3D method (SAM 3D) to jointly reconstruct multiple objects in a scene, targeting physically plausible non-penetrating object configurations and realistic contacts.
Empirical result: MessyKitchens outperforms prior datasets on registration accuracy and inter-object penetration metrics; MOD consistently improves multi-object reconstruction quality across three datasets compared to state-of-the-art baselines.
Release: benchmark, code, and pre-trained models will be publicly available.

Data & Methods

Dataset (MessyKitchens):
- Real-world indoor kitchen scenes with clutter.
- High-fidelity object-level ground truth: 3D object shapes, object poses, and accurate contact/interaction annotations.
- Designed to stress occlusion, variety of objects, and complex object relations.
Method:
- Builds on SAM 3D (a recent single-object monocular reconstruction approach).
- Introduces a Multi-Object Decoder (MOD) to jointly decode multiple object shapes and poses from a single image, aiming to reduce penetrations and produce physically consistent scenes.
- Evaluation compares MOD and MessyKitchens against prior datasets and methods on registration accuracy and inter-object penetration, across three datasets/benchmarks.
Outcomes:
- Quantitative improvements in registration accuracy and reduced inter-object penetration.
- Demonstrated generalization gains of the multi-object approach on multiple datasets.

Implications for AI Economics

Productivity and automation:
- Better monocular multi-object 3D reconstruction lowers perception costs for robots and embodied agents (fewer sensors, less calibration), potentially accelerating deployment in logistics, household service robots, inspection, and manipulation tasks.
- Improved physically plausible reconstructions reduce failure modes (e.g., collisions) in simulation-to-real pipelines, lowering development time and operational risk.
Market and competition dynamics:
- High-quality public datasets and pre-trained models (open release) can lower entry barriers for startups and academic groups, intensifying competition and speeding innovation in robotics, AR/VR, and 3D content markets.
- Firms that integrate these advances can capture downstream value in automation services and content creation; incumbents with proprietary simulation/data assets may see competitive pressure.
Data and compute value:
- Realistic, object-level ground truth raises the value of dataset curation and annotation skills; demand for labeled 3D scene data (and tools to generate it) will increase.
- Improved monocular methods may reduce sensor-hardware costs but shift value to model training and compute resources (pre-training, finetuning) and to datasets capturing realistic interactions.
Labor and tasks:
- Applications that rely on robust scene understanding (household assistance, warehouse picking) could substitute for some manual labor tasks, shifting employment toward robot maintenance, supervision, and dataset/model engineering.
- New roles likely grow around dataset collection, annotation, and system integration.
Safety, regulation, and externalities:
- Physically-plausible reconstructions reduce unsafe behaviors in deployed agents, which can lower liability and regulatory friction, but wider home-scene sensing raises privacy concerns and potential regulatory scrutiny.
- Open benchmarks accelerate research but may also enable dual-use applications; firms and policymakers will need to weigh benefits against misuse risks.
Economic research opportunities:
- Study of how open high-quality 3D datasets change firm entry, R&D investment allocation, and the diffusion rate of embodied AI capabilities.
- Measurement of downstream productivity gains from improved perception across sectors (warehousing, domestic robotics, AR/VR content production).

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides clear empirical evidence that the MessyKitchens dataset and the Multi-Object Decoder (MOD) improve reconstruction metrics (registration accuracy and inter-object penetration) across three benchmarks using quantitative comparisons to prior datasets and single-object baselines; however, evidence is limited to technical performance on vision benchmarks and does not establish downstream economic or field deployment impacts. Methods Rigorhigh — Uses real-world, high-fidelity object-level ground truth (shapes, poses, explicit contact annotations), evaluates on three datasets/benchmarks with quantitative metrics, and extends a state-of-the-art backbone (SAM 3D) with a multi-object decoder to target known failure modes (penetration); results are consistently reported and public release of code/models/benchmark increases reproducibility. SampleReal-world cluttered indoor kitchen scenes with high-fidelity object-level 3D ground truth: per-object 3D shapes, object poses, and explicit contact/interaction annotations designed to stress occlusion and complex object relations; exact dataset size/number of scenes and object category coverage not provided in the summary. Themesproductivity innovation GeneralizabilityDataset limited to indoor kitchen scenes—may not generalize to outdoor, industrial, or other domestic environments, Single-image monocular reconstruction setting may not transfer to multi-view, depth-sensor, or active-perception systems without adaptation, Object category and appearance coverage may be narrow (everyday kitchen objects) and may not include tools, articulated objects, or deformables, Performance demonstrated on vision benchmarks; real-world robotics deployment (manipulation, collision avoidance) requires integration, control, and domain-adaptation tests, Improvements rely on backbone (SAM 3D) and benchmark conditions; gains may differ with other architectures or in significantly different lighting/occlusion regimes

Claims (12)

Claim	Direction	Confidence	Outcome	Details
MessyKitchens is a high-fidelity real-world dataset of cluttered indoor kitchen scenes with object-level 3D ground truth (object shapes, object poses, and explicit contact information between objects). Other	null_result	high	dataset contents: object 3D shapes, object poses, object contact/interaction annotations	0.18
MessyKitchens is designed to stress occlusion, object variety, and complex inter-object relations (i.e., it is more realistic/physically-rich than prior datasets). Other	null_result	high	dataset characteristics: levels of occlusion, object variety, and annotated object relations (qualitative/design claim)	0.18
The paper introduces a Multi-Object Decoder (MOD) that extends SAM 3D to jointly reconstruct multiple objects from a single image, targeting physically plausible, non-penetrating object configurations and realistic contacts. Other	positive	high	methodological capability: joint multi-object monocular 3D reconstruction, objective of reducing inter-object penetration	0.18
MOD (built on SAM 3D) produces fewer inter-object penetrations and more physically plausible object configurations than single-object monocular methods. Other	positive	medium	inter-object penetration (penetration depth/volume or similar metric indicating object intersections)	0.11
The MessyKitchens dataset and MOD together yield materially better registration accuracy than prior datasets and single-object methods. Other	positive	medium	registration accuracy (pose alignment / object registration error metrics)	0.11
MOD consistently improves multi-object reconstruction quality across three datasets/benchmarks compared to state-of-the-art baselines. Other	positive	medium	multi-object reconstruction quality (aggregate metrics used in paper across three benchmarks)	n=3 0.11
The dataset and MOD produce far less inter-object penetration than prior datasets and single-object methods, with consistent improvements demonstrated across three benchmarks. Other	positive	medium	inter-object penetration metrics (e.g., penetration depth/volume, collision counts)	n=3 0.11
The paper reports quantitative improvements (registration accuracy and reduced inter-object penetration) and demonstrates generalization gains of the multi-object approach on multiple datasets. Other	positive	medium	registration accuracy; inter-object penetration; cross-dataset generalization performance	n=3 0.11
The authors will publicly release the benchmark, code, and pre-trained models. Adoption Rate	null_result	medium	availability of benchmark, code, and pre-trained models (public release)	0.11
Better monocular multi-object 3D reconstruction can lower perception costs for robots and embodied agents (fewer sensors, less calibration) and accelerate deployment in logistics, household service robots, inspection, and manipulation tasks. Firm Productivity	positive	speculative	perception cost and deployment barriers for robotic/embodied systems (economic/operational outcomes, not empirically measured in paper)	0.02
Open release of a high-quality 3D dataset and pre-trained models will lower entry barriers and intensify competition in robotics, AR/VR, and 3D content markets. Market Structure	positive	speculative	market entry barriers and competitive dynamics (economic outcomes, speculative)	0.02
Physically-plausible reconstructions reduce unsafe behaviors in deployed agents (e.g., collisions) and lower simulation-to-real failure modes. Ai Safety And Ethics	positive	speculative	failure modes in simulation-to-real transfer and safety (collisions/failures) — not empirically measured in the paper summary	0.02