MolmoAct Creates Spatial Maps for Robots to Plot Their Actions Before Executing Text Directions

Robot control systems that accept only text input struggle to translate words into motions in space. Researchers developed a system that enables robots to plan spatial paths before they execute text instructions.

What’s new: Jason Lee and colleagues at Allen Institute for AI and University of Washington introduced MolmoAct, a robotics action system that improved a 3-jointed robot arm’s ability to manipulate objects and perform multi-step tasks by first estimating spatial depth and planning motion paths. The weights and code are available for noncommercial and commercial uses under the Apache 2 license, while the authors’ fine-tuning dataset is available under CC BY 4.0.

Key insight: Natural-language instructions don’t translate precisely into spatial directions. Just as humans can navigate more effectively with a map, robots perform more accurately given a sense of 3D space (a depth map) and the desired trajectory (a motion path drawn over a camera’s view). Along with a command like “take the cup off the table and put it in the trash,” the additional information enables a robot to avoid collisions with objects and move more precisely.

How it works: MolmoAct uses a SigLIP2 pretrained vision transformer to encode camera images into tokens. Given the image tokens and text instructions, a pretrained Qwen2.5-7B large language model learned to generate tokens that represented (i) a depth map, (ii) a motion path, and (ii) changes in joint positions.

The authors started with 24.3 million robot demonstrations of tasks such as “pick up the water bottle from the drawer and put it on the desk.” Each example included a text instruction, camera views, and changes in joint positions. The authors augmented the examples with depth maps and motion paths.They generated the depth maps using a pretrained Depth Anything 2, and they produced visual paths by tracking the robot arm’s gripper in the camera images using Molmo, a pretrained vision-language model.
They trained Qwen2.5-7B on the augmented dataset. Given a text instruction and camera image, the model learned to generate tokens that represented, in this order, (i) a depth map, (ii) a visual path, and (iii) changes in joint positions.
To improve the system’s vision-language understanding, they further pretrained both models on 2 million examples of images and text scraped from the web.
The authors fine-tuned the models to generate the next token in more than 2 million examples, which they collected themselves, of robots performing various tasks from start to finish. The examples included various combinations of text instructions, camera views, changes in joint positions, depth maps, and motion paths.
At inference, users can see the next motion path before the robot moves and revise it by redrawing it via a tablet. This capability makes the robot’s actions interpretable and enables users to address potential errors before they happen.

Results: The authors tested MolmoAct’s performance using one or two Franka robotic arms in a simulation as well as 15 real-world tasks, including opening a container, putting trash in a bin, and folding a towel. On average, the system outperformed all other competitors.

MolmoAct achieved 86.6 percent average success on diverse simulated challenges in LIBERO. The closest competitor, π0-FAST, achieved 85.5 percent average success.
In real-world tasks, MolmoAct achieved 0.679 average task progress (a 0-to-1 score that represents how much of each task the robot completed, higher is better), while π0-FAST achieved 0.446 average task progress.

Why it matters: Earlier robotic control systems that use LLMs to interpret text instructions map visual input and text instructions directly to low-level actions without explicitly representing 3D space or visual motion paths. MolmoAct's approach makes such systems more precise, adaptable, and explainable.

We’re thinking: This robot system is definitely not lost in space!