HomeGuard focuses on contextual risk in household tasks: the instruction itself may be harmless, but the surrounding scene makes execution unsafe.
HomeSafe organizes household contextual hazards into fine-grained risk categories, paired with matched safe counterparts for grounded embodied safety learning. Here are some examples of HomeSafe, which consist of (1) a user instruction, (2) a scene image, and (3) fine-grained grounding annotations. Target areas are indicated by green bounding boxes, while constraint areas are marked in red.
HomeGuard-grounded planning. The plan is generated by Qwen3-VL-8B-Thinking (planner), the execution trajectory is produced by RoboBrain2.5-8B (controller), and the safety_tip is provided by HomeGuard-8B (safeguard).
Trajectory-level intervention. HomeGuard produces reusable grounded cues that guide the controller toward a safer execution path under the same instruction.
Vision-Language Models (VLMs) empower embodied agents to execute complex instructions, yet they remain vulnerable to contextual safety risks where benign commands become hazardous due to subtle environmental states. Existing safeguards often prove inadequate. Rule-based methods lack scalability in object-dense scenes, whereas model-based approaches relying on prompt engineering suffer from unfocused perception, resulting in missed risks or hallucinations. To address this, we propose an architecture-agnostic safeguard featuring Context-Guided Chain-of-Thought (CG-CoT). This mechanism decomposes risk assessment into active perception that sequentially anchors attention to interaction targets and relevant spatial neighborhoods, followed by semantic judgment based on this visual evidence. We support this approach with a curated grounding dataset and a two-stage training strategy utilizing Reinforcement Fine-Tuning (RFT) with process rewards to enforce precise intermediate grounding. Experiments demonstrate that our model HomeGuard significantly enhances safety, improving risk match rates by over 30% compared to base models while reducing oversafety. Beyond hazard detection, the generated visual anchors serve as actionable spatial constraints for downstream planners, facilitating explicit collision avoidance and safety trajectory generation.
HomeSafe uses edit-based synthesis to preserve realistic household context while constructing counterfactual safe and unsafe scenario pairs.
Two-stage optimization. HomeGuard first learns grounded hazard understanding through supervised fine-tuning, then sharpens intermediate perception and semantic judgment with reinforcement fine-tuning guided by process rewards.
HomeGuard improves both in-domain risk grounding and cross-benchmark generalization while reducing oversafety in downstream embodied decision-making.
@article{lu2026homeguard,
title={HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task},
author={Lu, Xiaoya and Zhou, Yijin and Chen, Zeren and Wang, Ruocheng and Sima, Bingrui and Zhou, Enshen and Sheng, Lu and Liu, Dongrui and Shao, Jing},
journal={arXiv preprint arXiv:2603.14367},
year={2026}
}