A modular agent architecture (CRAEA) significantly boosts simulated home-robot planning, memory-based question answering and routing, making long-horizon natural-language household tasks more reliable; gains are shown in simulation and human ratings but real-world robustness and deployment costs remain untested.

Context-Rich Adaptive Embodied Agents: Enhancing LLM-Powered Task Planning and Memory in Home Robotics

Yutian Gai, Haoyu Cen · March 05, 2026 · Preprints.org

openalex descriptive medium evidence 7/10 relevance DOI Source PDF

CRAEA, a modular agent combining semantic planning, multi-modal contextual memory, and adaptive routing, improves simulated home-robot performance on long-horizon instructions, multi-hop memory QA, and routing/coordination, with human raters preferring its outputs and ablations confirming each module's contribution.

The rapid evolution of Embodied AI and Large Language Models presents significant opportunities for home robotics, yet challenges persist in enabling robots to execute long-term, high-level natural language instructions. Current LLM-driven embodied agents often suffer from sub-optimal task planning, limited memory systems struggling with multi-hop queries, and inflexible agent routing mechanisms. To address these limitations, we propose the Context-Rich Adaptive Embodied Agent (CRAEA) framework, designed to significantly enhance task planning and memory-augmented question answering in household robots. CRAEA integrates core components: Semantic-Enhanced Task Planning (SETP), which enriches LLM-driven planning with object relationship graphs, hierarchical strategies, and implicit physical constraints; Multi-Modal Contextual Memory (MMCM), which stores comprehensive contextual memory units in a relational graph for sophisticated multi-hop reasoning and employs an advanced retrieval mechanism with temporal decay; and Adaptive Agent Routing and Coordination (AARC), featuring intent recognition with confidence evaluation, proactive clarification, and a planning feedback loop. Evaluated in an artificial home environment across complex tidying scenarios, CRAEA consistently demonstrates superior performance. Empirical results show that CRAEA achieves notable improvements in Task Planning Accuracy, Knowledge Base Response Total Validity, and Agent Routing Success Rate compared to baseline methods. A human evaluation further confirms enhanced coherence, naturalness, and user satisfaction, while an ablation study validates the critical contribution of each proposed module. CRAEA represents a significant step towards more intelligent, robust, and user-adaptive home robots.

Summary

Main Finding

CRAEA (Context-Rich Adaptive Embodied Agent) substantially improves home-robot performance on long-horizon, high-level natural language instructions by combining semantic task planning, multi-modal contextual memory, and adaptive routing/coordination. In simulated household tidying tasks, CRAEA outperforms baseline LLM-driven embodied agents on task planning accuracy, multi-hop memory question answering, and routing/coordination success; human evaluations report higher coherence, naturalness, and user satisfaction. An ablation study attributes performance gains to each proposed module.

Key Points

Problem addressed
- Existing LLM-driven embodied agents show: sub-optimal high-level planning, weak memory for multi-hop queries, and inflexible agent routing/coordination.
CRAEA architecture (three core modules)
- Semantic-Enhanced Task Planning (SETP)
  - Enriches LLM-generated plans with object-relationship graphs, hierarchical strategies (task decomposition), and implicit physical/affordance constraints.
- Multi-Modal Contextual Memory (MMCM)
  - Stores comprehensive contextual memory units (multi-modal: visual, linguistic, temporal) in a relational graph for multi-hop reasoning.
  - Advanced retrieval mechanism with temporal decay weighting to prioritize recent/ relevant memories.
- Adaptive Agent Routing and Coordination (AARC)
  - Intent recognition with confidence evaluation, proactive clarification dialogues, and a planning feedback loop to refine plans during execution.
Empirical outcomes
- Consistent improvements versus baselines on:
  - Task Planning Accuracy
  - Knowledge Base Response Total Validity (multi-hop QA from memory)
  - Agent Routing Success Rate
- Human evaluation indicates improved perceived coherence, naturalness, and satisfaction.
- Ablation experiments show each module contributes meaningfully; removing any degrades performance.
Limitations noted
- Evaluation performed in an artificial/simulated home environment; real-world transfer, robustness to noisy perception, and hardware constraints remain open.
- Resource, compute, privacy, and deployment costs not fully quantified.

Data & Methods

Environment and tasks
- Artificial home simulation with complex tidying scenarios requiring multi-step object manipulation and long-horizon instruction following.
Inputs and representations
- Multi-modal inputs: natural language instructions, visual observations, and temporal context.
- Relational graphs for object relationships and memory storage.
System design details
- SETP: LLM-generated plans augmented by a semantic object graph and hierarchy enforcements to ensure physical plausibility.
- MMCM: Memory units encode modality, timestamp, and relational links; retrieval uses similarity plus temporal decay to support multi-hop reasoning.
- AARC: Intent classifier outputs confidence scores; low-confidence triggers clarification; a feedback loop allows plan updates during execution.
Evaluation metrics
- Objective: Task Planning Accuracy, Knowledge Base Response Validity (for QA), Agent Routing Success Rate, task completion rates.
- Subjective: Human evaluations rating coherence, naturalness, and satisfaction.
- Ablation: Systematic removal of SETP, MMCM, and AARC to quantify each module’s contribution.
Experimental comparisons
- Baselines: existing LLM-driven embodied agent approaches lacking one or more CRAEA components (e.g., memoryless LLM controllers, static routing).
- Statistical analyses: reported improvements across metrics (specific effect sizes not provided in the summary).
Reproducibility notes
- Full reproducibility requires access to simulation suite, model checkpoints, and hyperparameters (not included here).

Implications for AI Economics

Productivity and automation value
- Improved high-level instruction following and robust memory make home robots more practically useful for complex household tasks, raising their potential to substitute for routine human labor and increase household productivity.
Market demand and product differentiation
- Systems like CRAEA create differentiable product features (better planning, memory, conversational clarification) that firms can use to capture higher willingness-to-pay from consumers and expand service markets (e.g., eldercare, assisted living).
Investment and industry incentives
- Demonstrated gains from modular improvements (planning, memory, routing) suggest targeted R&D investments in memory-augmented models, multi-modal perception, and interactive control loops may yield high marginal returns.
Labor and distributional effects
- Greater autonomy in home robots could reduce demand for some in-home service roles, while increasing demand for higher-skill roles (robot maintenance, customization, AI oversight). Policies should consider retraining and safety nets.
Data, privacy, and value capture
- Rich contextual memories and continuous home interaction create valuable data streams. Firms owning memory and model infrastructure could capture substantial value, raising concerns about data governance, consent, and monetization.
Cost structure and deployment considerations
- Real-world deployment will incur hardware, compute, and maintenance costs; economic viability depends on balancing upfront investment against labor substitution and consumer willingness-to-pay. Energy and compute-intensive memory/LLM components may affect unit economics.
Regulatory and welfare considerations
- Proactive clarification and adaptive behavior reduce risks from mis-execution, improving consumer safety and trust, but regulators may require standards for memory retention, data minimization, and explainability.
Research and policy recommendations
- Conduct cost–benefit analyses comparing CRAEA-style systems against human-provided services across different tasks and populations.
- Study labor market impacts at regional scales and time horizons, including complementarities with existing care economies.
- Establish guidelines for data ownership, privacy-preserving memory designs, and transparency around decision loops that modify behavior over time.

If you want, I can (a) draft an economic cost–benefit framework to estimate deployment thresholds for CRAEA-like robots, or (b) expand the summary with potential quantitative impact scenarios (labor substitution rates, consumer surplus estimates) given assumptions.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides controlled simulation experiments, quantitative metrics (planning accuracy, routing success, memory QA) and human evaluations, plus ablation tests that support technical claims; however evidence is limited to simulated environments, lacks real-world deployment tests, formal statistical detail and quantified effect sizes, and economic implications are speculative. Methods Rigormedium — Design includes ablation studies, multi-modal inputs, human evaluation, and comparisons to reasonable baselines, indicating careful engineering and evaluation, but the report lacks full reproducibility artifacts (code/checkpoints/hyperparameters), specific statistical effect sizes and significance reporting in the summary, and no real-world or noisy-perception robustness checks. SampleExperiments run in an artificial home simulation with complex, long-horizon tidying tasks using multi-modal inputs (language instructions, visual observations, temporal context); comparisons against existing LLM-driven embodied agent baselines; human raters provided subjective scores; ablation experiments removed SETP, MMCM, and AARC modules to measure contributions. Themesproductivity human_ai_collab GeneralizabilityEvaluations confined to simulated home environments, not physical robots or noisy sensors, Tasks limited to household tidying scenarios—may not transfer to other domains or task types, Human evaluation details (sample size, demographics) and statistical power not provided in summary, Computational, hardware, and latency constraints of real-world deployment not tested, Robustness to perception errors, occlusions, and dynamic real-world interactions unverified, Scale-up and long-term operation (memory growth, catastrophic drift) not evaluated, Privacy, data governance, and consumer acceptance in real households are speculative

Claims (14)

Claim	Direction	Confidence	Outcome	Details
CRAEA substantially improves home-robot performance on long-horizon, high-level natural language instructions by combining semantic task planning, multi-modal contextual memory, and adaptive routing/coordination. Output Quality	positive	medium	Overall home-robot performance on long-horizon, high-level NL instructions (aggregate of task planning accuracy, memory QA validity, routing success)	0.11
CRAEA outperforms baseline LLM-driven embodied agents on Task Planning Accuracy in simulated household tidying tasks. Output Quality	positive	medium	Task Planning Accuracy	0.11
CRAEA yields higher Knowledge Base Response Total Validity (improved multi-hop question answering from memory) than baselines. Output Quality	positive	medium	Knowledge Base Response Total Validity (multi-hop QA accuracy/validity)	0.11
CRAEA improves Agent Routing and Coordination success relative to baseline agents. Output Quality	positive	medium	Agent Routing Success Rate	0.11
Human evaluators rate CRAEA higher on perceived coherence, naturalness, and user satisfaction compared to baselines. Consumer Welfare	positive	medium	Human subjective ratings: coherence, naturalness, user satisfaction	0.11
An ablation study shows that removing any of the three core modules (SETP, MMCM, AARC) degrades CRAEA's performance; each module contributes meaningfully to overall gains. Output Quality	positive	medium	Change in performance metrics (Task Planning Accuracy, KB Response Validity, Routing Success) under module ablation	0.11
The Semantic-Enhanced Task Planning (SETP) module enriches LLM-generated plans with object-relationship graphs, hierarchical task decomposition, and implicit physical/affordance constraints to improve plan plausibility. Output Quality	positive	medium	Plan plausibility/validity and Task Planning Accuracy	0.11
The Multi-Modal Contextual Memory (MMCM) stores multi-modal (visual, linguistic, temporal) contextual memory units in a relational graph and uses an advanced retrieval mechanism with temporal decay weighting to support multi-hop reasoning. Output Quality	positive	medium	Multi-hop question-answering validity (Knowledge Base Response Total Validity); memory retrieval relevance	0.11
The Adaptive Agent Routing and Coordination (AARC) module performs intent recognition with confidence scoring, triggers proactive clarification dialogues on low confidence, and provides a planning feedback loop to refine plans during execution. Output Quality	positive	medium	Agent Routing Success Rate; frequency and effectiveness of clarifications; plan refinement success	0.11
Evaluation was performed in an artificial/simulated home environment; therefore real-world transfer, robustness to noisy perception, and hardware constraints remain open questions. Other	negative	high	Generalizability/real-world transfer (qualitative limitation)	0.18
Resource, compute, privacy, and deployment costs associated with CRAEA were not fully quantified in the paper. Other	negative	high	Quantification of resource/compute/privacy/deployment costs (absence of measurement)	0.18
Statistical analyses reported improvements across metrics, but specific effect sizes and detailed statistical results were not provided in the summary. Research Productivity	null_result	high	Presence/absence of detailed statistical effect-size reporting	0.18
CRAEA-style systems could increase household productivity and substitute for some routine in-home human labor, altering demand for certain service roles and increasing demand for higher-skill roles (e.g., maintenance, AI oversight). Job Displacement	mixed	speculative	Labor demand shifts (theoretical implication, not empirically measured in the study)	0.02
Rich contextual memories and continuous home interaction create valuable data streams that could enable firms to capture substantial value, raising concerns about data governance, consent, and monetization. Firm Revenue	negative	speculative	Data generation and value-capture potential (qualitative implication)	0.02