The IVE (Imagine-Verify-Execute) agent autonomously explores Tangram pieces in the real world (top row), commonobjects (middle row), and objects in simulation (bottom row). Across these tasks, IVE converts visual input to semantic scene graphs, imagines novel configurations, verifies their physical feasibility, and executes actions to gather diverse, semantically-grounded data for downstream learning.
Exploration is essential for general-purpose robotic learning, especially in open-ended environments where dense rewards, explicit goals, or task-specific supervision are scarce. Vision-language models (VLMs), with their semantic reasoning over objects, spatial relations, and potential outcomes, present a compelling foundation for generating high-level exploratory behaviors. However, their outputs are often ungrounded, making it difficult to determine whether imagined transitions are physically feasible or informative. To bridge the gap between imagination and execution, we present IVE (Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity. Human exploration is often driven by the desire to discover novel scene configurations and to deepen understanding of the environment. Similarly, IVE leverages VLMs to abstract RGB-D observations into semantic scene graphs, imagine novel scenes, predict their physical plausibility, and generate executable skill sequences through action tools. We evaluate IVE in both simulated and real-world tabletop environments. The results show that IVE enables more diverse and meaningful exploration than RL baselines, as evidenced by a 4.1 to 7.8× increase in the entropy of visited states. Moreover, the collected experience supports downstream learning, producing policies that closely match or exceed the performance of those trained on human-collected demonstrations.
Overview of IVE (Imagine, Verify, Execute). The Scene Describer constructs a scene graph from observations, the Explorer imagines novel configurations guided by memory retrieval, and the Verifier predict the physical plausibility of proposed transitions. Verified plans are executed using action tools. Exploration is structured around semantic reasoning, verification, and physically grounded interaction.
IVE makes it easy to collect diverse and informative exploration data at scale. Here we show a human expert collecting data using OpenTeach, a popular tool for collecting expert demonstrations (left), a human-in-the-loop collecting data using the same action tools exposed to IVE (middle), and IVE collecting data autonomously through embodied exploration (right).
Exploration capability evaluation across simulated and real-world environments. Top: Growth curves of the number of unique scene graphs visited. Bottom: Diversity of visited states, measured by entropy.
Ablation study of IVE. (Top) Illustration of each variant, highlighting removed modules in gray. (Bottom) Exploration performance is measured by the number of unique scene graphs, entropy, empowerment, and information gain (see Appendix A for metric details). Removing retrieval memory causes the largest performance drop, emphasizing the importance of experience-grounded generation. Verifier removal slightly degrades performance, particularly in terms of information gain, but empowerment remains relatively stable. The random tool selector baseline performs the worst across all metrics. Our full model (IVE) approaches human-level exploration efficiency.
Exploring with Embodied Agents. IVE, powered by different Vision-Language Models (VLMs) against a human expert. The plots show performance across four key metrics as a function of interaction: (Left) the growth in the number of unique scene graphs discovered, (Middle Left) the entropy of visited states (a measure of diversity), (Middle Right) empowerment (the agent's ability to influence future states), and (Right) information gain (the amount of new information acquired). Notably, IVE, regardless of the VLM used, surpasses the human expert in generating unique scene graphs, achieving higher state diversity, and gaining more information.
Exploration Method | Non-conditional | Goal-conditional | ||
---|---|---|---|---|
# of achieved tasks | Entropy | Success rate | ||
VIMA Bench (5 objects) | SAC (Haarnoja et al., 2018) + RND (Burda et al., 2019) | 2.0 | 1.907 | 8.33% |
SAC (Haarnoja et al., 2018)+ RE3 (Seo et al., 2021) | 2.1 | 1.754 | 0.00% | |
IVE (Ours) | 4.1 | 2.283 | 58.33% | |
Human | 3.6 | 2.021 | 50.00% | |
VIMA Bench (4 objects) | SAC (Haarnoja et al., 2018) + RND (Burda et al., 2019) | 2.0 | 1.907 | 0.00% |
SAC (Haarnoja et al., 2018) + RE3 (Seo et al., 2021) | 1.2 | 1.959 | 0.00% | |
IVE (Ours) | 3.1 | 1.528 | 41.67% | |
Human | 4.2 | 1.897 | 33.33% |
Exploration Method | Sim Env 1 | Sim Env 2 | Real World | |||
---|---|---|---|---|---|---|
SSIM (↑) | LPIPS (↓) | SSIM (↑) | LPIPS (↓) | SSIM (↑) | LPIPS (↓) | |
SAC (Haarnoja et al., 2018) + RND (Burda et al., 2019) | 0.812 ± 0.039 | 0.198 ± 0.060 | 0.855 ± 0.036 | 0.168 ± 0.061 | - | - |
SAC (Haarnoja et al., 2018) + RE3 (Seo et al., 2021) | 0.814 ± 0.040 | 0.199 ± 0.057 | 0.850 ± 0.034 | 0.168 ± 0.059 | - | - |
IVE (Ours) | 0.837 ± 0.032 | 0.129 ± 0.044 | 0.853 ± 0.032 | 0.160 ± 0.058 | 0.634 ± 0.075 | 0.181 ± 0.056 |
Human | 0.833 ± 0.032 | 0.126 ± 0.042 | 0.862 ± 0.027 | 0.139 ± 0.047 | 0.653 ± 0.072 | 0.194 ± 0.056 |
@article{lee2025imagine,
title={Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models},
author={Lee, Seungjae and Ekpo, Daniel and Liu, Haowen and Huang, Furong and Shrivastava, Abhinav and Huang, Jia-Bin},
year={2025},
eprint={2505.07815},
archivePrefix={arXiv}
}