IS-Bench

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

¹Shanghai AI Laboratory, ²Shanghai Jiao Tong University,
³Beihang University ⁴Fudan University
⁵Shanghai Innovation Institute ⁶Tongji University

*Equally Contribution †Corresponding Author

Abstract

Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems.

Overview

We introduce IS-Bench to comprehensively evaluate an agent's interactive safety, especially its ability to handle complex safety hazards, such as dynamic risks, through a process-oriented evaluation manner. Our IS-Bench encompasses 161 interactive evaluation scenarios with 388 unique safety risks spanning 10 domestic safety categories, as shown above. From the perspective of evaluation timing, these safety risks can be categorized as either pre-caution or post-caution, which account for 24.2% and 75.8%, respectively. To support the planning and execution of safety-aware tasks, we design 18 distinct skill primitives and implement them in the Omnigibson simulator.

Data Construction

We begin by prompting GPT-4o to extract safety principles that the agent must adhere to in the household scenes from Behavior-1K dataset. Guided by these principles, we integrate corresponding safety risks, especially the dynamic risks emergent from an agent's actions, into the Behavior-1K household tasks by detecting existing hazards and strategically introducing new, risk-inducing objects. Then, we generate the safety goal conditions for each tasks. Each goal includes a natural language description and a corresponding predicate based on the Planning Domain Definition Language (PDDL). We finally instantiate each task within the OmniGibson simulator to ensure that each risk-aware task is reproducible.

Evaluation Framework

We measures agent's ability to complete a task while respecting all safety constraints within the interactive OmniGibson simulator. For each plan executed by an agent, our framework checks whether every annotated safety goal condition is satisfied according to its trigger. As a complementary analysis, we also evaluate the agent's explicit safety awareness. In this setup, the agent is provided with the task instruction and initial visual context and prompted to describe the potential safety issues it needs to consider before planning.

Benchmark Results

We assess the interactive safety of various VLM-driven agents, including open-source models like Qwen2.5-VL and InternVL2, alongside proprietary models such as GPT-4o and Gemini-2.5-series. We prompt VLM-driven agents to perform task planning under following settings: L1: implicit safety reminder, L2: safety CoT reminder.

We summarize several key observations: 1. Current Embodied Agents Lack Interactive Safety Capability. 2. Safety-Aware CoT Improves Interactive Safety but Compromises Task Completion. 3. The suboptimal performance in the L1 and L2 settings stems directly from a poor ability to proactively perceive and identify risks in a dynamic environment.

To investigate how multi-modal context, especially the visual inputs, influences interactive safety, we conduct an ablation study analyzing different auxiliary inputs: bounding boxes for manipulable objects (BBox), self-generated scene captions (Caption), and ground-truth descriptions of the initial scene setup (IS), which describes the layout of objects in the initial scene.

BibTeX

@article{lu2025bench, title={IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks}, author={Lu, Xiaoya and Chen, Zeren and Hu, Xuhao and Zhou, Yijin and Zhang, Weichen and Liu, Dongrui and Sheng, Lu and Shao, Jing}, journal={arXiv preprint arXiv:2506.16402}, year={2025} }