Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models

Abstract

Exploration is essential for general-purpose robotic learning, especially in open-ended environments where dense rewards, explicit goals, or task-specific supervision are scarce. Vision-language models (VLMs), with their semantic reasoning over objects, spatial relations, and potential outcomes, present a compelling foundation for generating high-level exploratory behaviors. However, their outputs are often ungrounded, making it difficult to determine whether imagined transitions are physically feasible or informative. To bridge the gap between imagination and execution, we present IVE (Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity. Human exploration is often driven by the desire to discover novel scene configurations and to deepen understanding of the environment. Similarly, IVE leverages VLMs to abstract RGB-D observations into semantic scene graphs, imagine novel scenes, predict their physical plausibility, and generate executable skill sequences through action tools. We evaluate IVE in both simulated and real-world tabletop environments. The results show that IVE enables more diverse and meaningful exploration than RL baselines, as evidenced by a 4.1 to 7.8× increase in the entropy of visited states. Moreover, the collected experience supports downstream learning, producing policies that closely match or exceed the performance of those trained on human-collected demonstrations.

Method Overview

Overview of IVE (Imagine, Verify, Execute). The Scene Describer constructs a scene graph from observations, the Explorer imagines novel configurations guided by memory retrieval, and the Verifier predict the physical plausibility of proposed transitions. Verified plans are executed using action tools. Exploration is structured around semantic reasoning, verification, and physically grounded interaction.

IVE vs. Human Exploration

IVE makes it easy to collect diverse and informative exploration data at scale. Here we show a human expert collecting data using OpenTeach, a popular tool for collecting expert demonstrations (left), a human-in-the-loop collecting data using the same action tools exposed to IVE (middle), and IVE collecting data autonomously through embodied exploration (right).

Results and Ablation

Experiment results — **Exploration capability evaluation across simulated and real-world environments.** Top: Growth curves of the number of unique scene graphs visited. Bottom: Diversity of visited states, measured by entropy.

Ablations — **Ablation study of IVE.** (Top) Illustration of each variant, highlighting removed modules in gray. (Bottom) Exploration performance is measured by the number of unique scene graphs, entropy, empowerment, and information gain (see Appendix A for metric details). Removing retrieval memory causes the largest performance drop, emphasizing the importance of experience-grounded generation. Verifier removal slightly degrades performance, particularly in terms of information gain, but empowerment remains relatively stable. The random tool selector baseline performs the worst across all metrics. Our full model (IVE) approaches human-level exploration efficiency.

Ablation baselines — **Exploring with Embodied Agents.** IVE, powered by different Vision-Language Models (VLMs) against a human expert. The plots show performance across four key metrics as a function of interaction: (Left) the growth in the number of unique scene graphs discovered, (Middle Left) the entropy of visited states (a measure of diversity), (Middle Right) empowerment (the agent's ability to influence future states), and (Right) information gain (the amount of new information acquired). Notably, IVE, regardless of the VLM used, surpasses the human expert in generating unique scene graphs, achieving higher state diversity, and gaining more information.

Table 1: Success rates (%) of goal-conditional behavior cloning across tasks in simulation. Our method achieves human-level performance and significantly outperforms exploration RL baselines (RND (Burda et al., 2019) and RE3 (Seo et al., 2021)), demonstrating the effectiveness of our exploration strategy in generating diverse and semantically meaningful data.

	Exploration Method	Non-conditional		Goal-conditional
	Exploration Method	# of achieved tasks	Entropy	Success rate
VIMA Bench (5 objects)	SAC (Haarnoja et al., 2018) + RND (Burda et al., 2019)	2.0	1.907	8.33%
	SAC (Haarnoja et al., 2018)+ RE3 (Seo et al., 2021)	2.1	1.754	0.00%
	IVE (Ours)	4.1	2.283	58.33%
	Human	3.6	2.021	50.00%
VIMA Bench (4 objects)	SAC (Haarnoja et al., 2018) + RND (Burda et al., 2019)	2.0	1.907	0.00%
	SAC (Haarnoja et al., 2018) + RE3 (Seo et al., 2021)	1.2	1.959	0.00%
	IVE (Ours)	3.1	1.528	41.67%
	Human	4.2	1.897	33.33%

Table 2: Quantitative evaluation of World Model (WM) predictions using datasets collected by different exploration methods, trained with DINO-WM (Zhou et al., 2024a).

Exploration Method	Sim Env 1		Sim Env 2		Real World
Exploration Method	SSIM (↑)	LPIPS (↓)	SSIM (↑)	LPIPS (↓)	SSIM (↑)	LPIPS (↓)
SAC (Haarnoja et al., 2018) + RND (Burda et al., 2019)	0.812 ± 0.039	0.198 ± 0.060	0.855 ± 0.036	0.168 ± 0.061	-	-
SAC (Haarnoja et al., 2018) + RE3 (Seo et al., 2021)	0.814 ± 0.040	0.199 ± 0.057	0.850 ± 0.034	0.168 ± 0.059	-	-
IVE (Ours)	0.837 ± 0.032	0.129 ± 0.044	0.853 ± 0.032	0.160 ± 0.058	0.634 ± 0.075	0.181 ± 0.056
Human	0.833 ± 0.032	0.126 ± 0.042	0.862 ± 0.027	0.139 ± 0.047	0.653 ± 0.072	0.194 ± 0.056

Prompts

Scene Describer

Explorer

Verifier

BibTeX

@article{lee2025imagine,
    title={Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models},
    author={Lee, Seungjae and Ekpo, Daniel and Liu, Haowen and Huang, Furong and Shrivastava, Abhinav and Huang, Jia-Bin},
    year={2025},
    eprint={2505.07815},
    archivePrefix={arXiv}
}