HomeGuard | VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task

HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task

¹Shanghai AI Laboratory; ²Shanghai Jiao Tong University; ³Beihang University; ⁴Huazhong University of Science and Technology

^*Equal Contribution ^✉Corresponding Authors

From scene context to grounded safety decisions with HomeGuard

Overview

HomeGuard focuses on contextual risk in household tasks: the instruction itself may be harmless, but the surrounding scene makes execution unsafe.

Implicit contextual risk: risk comes from object states and spatial relations, such as heating food when metal utensils are already inside a microwave.
Context-Guided Chain-of-Thought: the model separates active perception from semantic risk judgment, helping it focus on the right interaction targets and constraint regions.
Visual anchors for action: grounded boxes and safety tips are not just explanations; they can be reused by downstream planners and low-level trajectory modules.

HomeSafe Dataset: 10K Contextual-Risk Scenarios

HomeSafe organizes household contextual hazards into fine-grained risk categories, paired with matched safe counterparts for grounded embodied safety learning. Here are some examples of HomeSafe, which consist of (1) a user instruction, (2) a scene image, and (3) fine-grained grounding annotations. Target areas are indicated by green bounding boxes, while constraint areas are marked in red.

Abstract

Vision-Language Models (VLMs) empower embodied agents to execute complex instructions, yet they remain vulnerable to contextual safety risks where benign commands become hazardous due to subtle environmental states. Existing safeguards often prove inadequate. Rule-based methods lack scalability in object-dense scenes, whereas model-based approaches relying on prompt engineering suffer from unfocused perception, resulting in missed risks or hallucinations. To address this, we propose an architecture-agnostic safeguard featuring Context-Guided Chain-of-Thought (CG-CoT). This mechanism decomposes risk assessment into active perception that sequentially anchors attention to interaction targets and relevant spatial neighborhoods, followed by semantic judgment based on this visual evidence. We support this approach with a curated grounding dataset and a two-stage training strategy utilizing Reinforcement Fine-Tuning (RFT) with process rewards to enforce precise intermediate grounding. Experiments demonstrate that our model HomeGuard significantly enhances safety, improving risk match rates by over 30% compared to base models while reducing oversafety. Beyond hazard detection, the generated visual anchors serve as actionable spatial constraints for downstream planners, facilitating explicit collision avoidance and safety trajectory generation.

Two-Stage Training

Two-stage optimization. HomeGuard first learns grounded hazard understanding through supervised fine-tuning, then sharpens intermediate perception and semantic judgment with reinforcement fine-tuning guided by process rewards.

Benchmark Performance

HomeGuard improves both in-domain risk grounding and cross-benchmark generalization while reducing oversafety in downstream embodied decision-making.

HomeSafe-Bench. HomeGuard delivers strong risk identification and risk matching on our curated contextual household benchmark.

Public benchmarks. The model generalizes to EARBench, MSSBench, PaSBench, and related safety evaluation settings.

Utility and efficiency. Better contextual grounding translates into safer planning behavior with less oversafety.

BibTeX

@article{lu2026homeguard, title={HomeGuard: VLM-based Embodied Safeguard for Identifying Contextual Risk in Household Task}, author={Lu, Xiaoya and Zhou, Yijin and Chen, Zeren and Wang, Ruocheng and Sima, Bingrui and Zhou, Enshen and Sheng, Lu and Liu, Dongrui and Shao, Jing}, journal={arXiv preprint arXiv:2603.14367}, year={2026} }