Training Cars to Reason: Nvidia’s Alpamayo-R1 is a robotics-style reasoning model for autonomous vehicles
Chain-of-thought reasoning can help autonomous vehicles decide what to do next.
Chain-of-thought reasoning can help autonomous vehicles decide what to do next.
What’s new: Nvidia released Alpamayo-R1, a vision-language action model for autonomous vehicles that uses reasoning to reduce potential collisions.
- Input/output: 2 seconds of video from each of four cameras, text commands, position and rotation history in; reasoning text, 6.4 seconds of a vehicle’s future trajectory (position and rotation) out with 99 milliseconds latency running on Nvidia RTX Pro 6000 (Blackwell)
- Architecture: Transformer encoder (8.2 billion parameters), transformer decoder (2.3 billion parameters)
- Performance: In simulation, fewer “close encounters” (distance unspecified) with other vehicles
- Availability: Weights available to download for noncommercial uses
- Undisclosed: Performance comparisons to competing models, datasets, and reward model used in training
How it works: Alpamayo-R1 comprises Cosmos-Reason1 (a vision-language model that’s pretrained to describe actions) and a diffusion transformer that produces vehicle trajectory data. Given video frames and trajectory data that represent the last 2 seconds as well as possible verbal commands, Cosmos-Reason1 produced reasoning text. Given Cosmos Reason’s embeddings of video frames, previous trajectory data, and reasoning text, the diffusion transformer produced future trajectory data. The authors trained the system in three phases:
- The authors trained Alpamayo-R1 to generate actions across multiple fields, including healthcare, logistics, retail, and manufacturing as well as autonomous driving.
- They trained Alpamayo-R1 to reason and produce actions using 80,000 hours of videos and vehicle motion labeled with human- or machine-produced reasoning. The reasoning text included up to two decisions at any particular video frame (such as stop, set speed, or merge) as well as any number of rationales for the decision (like a pedestrian in a crosswalk, lanes merging ahead, or road construction).
- They further trained the system via reinforcement learning to improve its reasoning skills and align its reasoning and actions. Specifically, they rewarded the system based on (i) how well its reasoning aligned with ground-truth reasoning according to an unspecified reward model, (ii) how well its reasoning aligned with subsequent actions according to simple rules, (iii) how well output actions aligned with ground-truth actions, whether or not predicted actions led to collisions, and how smoothly the vehicle executed its actions.
Results: The authors compared their system to a version that was trained on the same data except the reasoning datasets. In 75 simulated scenarios, the reasoning model experienced “close encounters” (distance undisclosed) with other vehicles 11 percent of the time, which is down from the non-reasoning model’s 17 percent.
Why it matters: Chain-of-thought reasoning is useful for robots. Unlike earlier vision-language-action models that use reasoning, Alpamayo-R1 was trained not only to encourage better performance but to match its actions with its reasoning. This made the model’s reasoning both more effective and more interpretable. In case of a mishap, an engineer can review the system’s reasoning to understand why it made a particular decision and then adapt training or inference to avoid similar outcomes in the future.
We’re thinking: In the past year, reasoning models have outperformed their non-reasoning counterparts in math, science, coding, image understanding, and robotics. Chain-of-thought turns out to be an extremely useful algorithm.