3D-Belief: A Generative 3D World Model for Embodied Reasoning and Planning

Abstract

Recent advances in visual generative models have shown the promise of learning generative world models. However, prior work has largely focused on rendering novel views of observed scenes or predicting future frames. While these models achieve impressive visual quality, they are not optimized to support downstream embodied reasoning and planning tasks. In this work, we theorize what properties generative world models should have to enhance embodied agents. As a first step, we focus on modeling an agent's beliefs over the 3D world and identify several key capabilities. These include spatially consistent scene memory, multi-hypothesis belief sampling, sequential belief updating, and semantically informed future prediction. We then propose 3D-Belief, a generative 3D world model that instantiates these capabilities. Unlike prior work, 3D-Belief predicts unseen regions in an explicit, actionable 3D representation from partial observations and updates this belief online as new observations arrive. It enables embodied agents to reason about the 3D world under partial observability and make sequential decisions based on up-to-date beliefs. We evaluate 3D-Belief on a novel 3D imagination benchmark, 3D-CORE, and challenging object navigation tasks. Experimental results show that robots driven by 3D-Belief outperform those using state-of-the-art models in both simulations and the real world.

Core Capabilities of 3D Belief Models

There remains a substantial gap between what previous visual generative models are trained to do and what a world model must provide for embodied decision making under partial observability. We theorize that robust embodied agents require a generative belief model that has the following key capacities.

Multi-hypothesis Belief Sampling

Sequential Belief Updating

Spatially Consistent Scene Memory

Semantically Informed Future Prediction

3D-Belief

We propose a generative 3D world model, 3D-Belief, that learns to predict and sequentially update explicit 3D representations of a scene online.

Model Overview

Planning with 3D-Belief

Simulated Results

We evaluate 3D-Belief on open-vocabulary object navigation tasks in the AI2-THOR simulator, where an agent must find a specified target object in a previously unseen environment.

Qualitative Results

Real-World Results

We evaluate 3D-Belief on a real mobile manipulation platform (Hello Robot Stretch) in a mock apartment environment. Note that the environments, objects, and target descriptions are all unseen during training, making this a real-world open-vocabulary object navigation setting.

Example Setup

Real-World Setup

Qualitative Results

Open-Vocabulary Object Navigation: Search for a mug.

3D-Belief (Success)

Gemini-3.0 Agent (Failed - Collision)

3D-CORE Benchmark

We introduce 3D-CORE (3D COntextual REasoning), a benchmark for evaluating whether 3D world models learn the kinds of belief reasoning that are directly required for embodied decision-making. 3D-CORE is designed to probe the capabilities that matter for downstream embodied planning directly: (1) spatial expansion beyond what is currently observed, (2) semantic reasoning grounded in 3D structure, and (3) long-horizon consistency under large viewpoint changes.

Overview of the 3D-CORE benchmark

Object Completion

Room Completion

Object Permanence

Qualitative Results of the the Object Permanence Task in 3D-CORE

Visual Quality

We compare the visual quality of 3D-Belief's predictions with state-of-the-art generative world models. Note that 3D-Belief generates more accurate and spatially consistent predictions that better match the ground truth.

Conclusion

In this work, we studied how generative world models can better support embodied reasoning and planning, and identified key capabilities for practical 3D belief modeling. We then proposed 3D-Belief, which predicts unseen regions in an explicit 3D representation from partial observations and updates this belief online. Experimental results on contextual reasoning and object navigation have shown that 3D-Belief improved both success rate and efficiency over existing generative world models.

BibTeX

@misc{yin20263dbelief,
        title={3D-Belief: A Generative 3D World Model for Embodied Reasoning and Planning},
        author={Yin, Yifan and Wen, Zehao and Chen, Jieneng and Zheng, Zehan and Dai, Nanru and Shi, Haojun and Ye, Suyu and Huang, Aydan and Zhang, Zheyuan and Yuille, Alan and Xie, Jianwen and Tewari, Ayush and Shu, Tianmin},
        year={2026},
        url= {https://3d-belief.github.io/}
}