RoST3R:
Robot-Aware Dynamic 3D Reconstruction for Robotic Manipulation

Author Names Omitted for Anonymous Review.



RoST3R reconstructs robot-aware, scale-aligned 3D scene representations directly from RGB images to boost policy learning. It achieves stronger generalization compared to policies trained on 2D images.

[Paper]      [Arxiv]      [Interactive Results 🔥]      [Code]

Interactive 4D Visualization

Mouse Controls

Left Click Drag with left click to rotate view
Right Click Drag with right click to move view
Scroll Wheel Scroll to zoom in/out

Keyboard Controls

W S forward and backward
A D left and right
Q E upward and downward

Abstract

3D scene representations offer stronger general- ization for policy learning compared to 2D representations, yet collecting such 3D data has required special sensors. Previous methods for 3D reconstruction from video exist, but have been unsuitable for robotic learning due to error and lack of metric calibration. In this work, we demonstrate that 3D scene representations can be reliably reconstructed from standard 2D RGB images, making it both accessible and practical for robot learning. We propose a novel framework, RoST3R (Robot MonST3R), that incrementally reconstructs dynamic 3D scenes at metric scale from RGB images, enabling 3D-aware policy learning in complex environments from only 2D inputs. At its core, our approach estimates the robot’s pose during scene reconstruction, registers its kinematic structure within the environment, and builds a unified 3D scene representation. This unified 3D representation offers two key benefits: it enables policy learning at metric scale in a consistent world frame—decoupling object and camera dynamics—and provides a coherent model of the robot and environment to support fine- grained spatial reasoning. Notably, while the input remains 2D, our approach generates a 3D-aware representation that significantly improves generalization. Experiments show that policies trained with this 3D representation outperform those trained on 2D inputs, particularly in tasks involving environ- mental variations, novel viewpoints and camera motion. In simulation, our method outperforms 2D counterparts by 24.5% under environmental variations and dynamic camera motion. In real-world scenarios, it achieves a 29.5% performance improvement.

Robot-Aware Scale-Aligned Dynamic 3D Reconstruction

RoST3R extends MonST3R for incremental dynamic 3D reconstruction in the world coordinate, by adjusting the pair-sampling strategy for a global streaming pointmap optimization (Section III-A). Then, by aligning the robot 3D model with 2D observations in each frame (Section III-B), RoST3R can reliably register the robot into the environment in a unified 3D space, as well as calibrate the environment reconstruction to be metric-scale (Section III-C).

Results - Robot Pose Estimation

Quantitatively our framework accurately estimates robot pose in real-world scenarios (Panda 3CAM) and under partial occlusion conditions (RoboVerse). In each image, the robot mesh is projected onto the image using the pose estimated by our method.

Visualization of pose estimation

Simulation Results - RoboVerse

Quantitatively, Our RoST3R 3D representation demonstrates superior generalization ability compared to its 2D-based counterparts.

Generalization levels of evaluation on RoboVerse benchmark

Real World Results

Qualitative comparison of real-world task executions using Diffusion Policy (Left) and RoST3R-DP3 (Right), shown at 3× speed.

DP
RoST3R-DP3

Quantitatively, Our method outperforms 2D-based Diffusion Policy by 29.5%, highlighting the importance of 3D reasoning capabilities.