HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task

1Shanghai AI Laboratory; 2Shanghai Jiao Tong University; 3Beihang University; 4Huazhong University of Science and Technology
Equal Contribution   ✉ Corresponding Authors  

From scene context to grounded safety decisions with HomeGuard
HomeGuard teaser

Existing model-based safeguards often fail not because VLMs lack reasoning ability, but because their perception is unfocused in cluttered scenes and they miss the objects that truly determine risk. HomeGuard addresses this with Context-Guided Chain-of-Thought (CG-CoT), which uses visual anchors to guide attention toward interaction targets and risk-critical context before semantic judgment. To make this capability practical for embodied deployment, we build the HomeSafe dataset and train HomeGuard with a two-stage pipeline.

Overview

HomeGuard focuses on contextual risk in household tasks: the instruction itself may be harmless, but the surrounding scene makes execution unsafe.

  • Implicit contextual risk: risk comes from object states and spatial relations, such as heating food when metal utensils are already inside a microwave.
  • Context-Guided Chain-of-Thought: the model separates active perception from semantic risk judgment, helping it focus on the right interaction targets and constraint regions.
  • Visual anchors for action: grounded boxes and safety tips are not just explanations; they can be reused by downstream planners and low-level trajectory modules.

HomeSafe Dataset: 10K Contextual-Risk Scenarios

HomeSafe organizes household contextual hazards into fine-grained risk categories, paired with matched safe counterparts for grounded embodied safety learning. Here are some examples of HomeSafe, which consist of (1) a user instruction, (2) a scene image, and (3) fine-grained grounding annotations. Target areas are indicated by green bounding boxes, while constraint areas are marked in red.

HomeGuard: Guard Model Built for Embodied Safety

HomeGuard-grounded planning. The plan is generated by Qwen3-VL-8B-Thinking (planner), the execution trajectory is produced by RoboBrain2.5-8B (controller), and the safety_tip is provided by HomeGuard-8B (safeguard).

HomeGuard planning case

Trajectory-level intervention. HomeGuard produces reusable grounded cues that guide the controller toward a safer execution path under the same instruction.

HomeGuard safe trajectory case

Abstract

Vision-Language Models (VLMs) empower embodied agents to execute complex instructions, yet they remain vulnerable to contextual safety risks where benign commands become hazardous due to subtle environmental states. Existing safeguards often prove inadequate. Rule-based methods lack scalability in object-dense scenes, whereas model-based approaches relying on prompt engineering suffer from unfocused perception, resulting in missed risks or hallucinations. To address this, we propose an architecture-agnostic safeguard featuring Context-Guided Chain-of-Thought (CG-CoT). This mechanism decomposes risk assessment into active perception that sequentially anchors attention to interaction targets and relevant spatial neighborhoods, followed by semantic judgment based on this visual evidence. We support this approach with a curated grounding dataset and a two-stage training strategy utilizing Reinforcement Fine-Tuning (RFT) with process rewards to enforce precise intermediate grounding. Experiments demonstrate that our model HomeGuard significantly enhances safety, improving risk match rates by over 30% compared to base models while reducing oversafety. Beyond hazard detection, the generated visual anchors serve as actionable spatial constraints for downstream planners, facilitating explicit collision avoidance and safety trajectory generation.

HomeSafe Data Generation Pipeline

HomeSafe uses edit-based synthesis to preserve realistic household context while constructing counterfactual safe and unsafe scenario pairs.

HomeSafe data recipe

Two-Stage Training

HomeGuard training framework

Two-stage optimization. HomeGuard first learns grounded hazard understanding through supervised fine-tuning, then sharpens intermediate perception and semantic judgment with reinforcement fine-tuning guided by process rewards.

Benchmark Performance

HomeGuard improves both in-domain risk grounding and cross-benchmark generalization while reducing oversafety in downstream embodied decision-making.

BibTeX

@article{lu2026homeguard,
  title={HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task},
  author={Lu, Xiaoya and Zhou, Yijin and Chen, Zeren and Wang, Ruocheng and Sima, Bingrui and Zhou, Enshen and Sheng, Lu and Liu, Dongrui and Shao, Jing},
  journal={arXiv preprint arXiv:2603.14367},
  year={2026}
}