How DeepSeek Did It: Researchers describe training methods and hardware choices for DeepSeek’s V3 and R1 models
DeepSeek made headlines late last year, when it built a state-of-the-art, open-weights large language model at a cost far lower than usual. The upstart developer shared new details about its method.
What’s new: Chenggang Zhao and colleagues at DeepSeek described software and hardware choices that reduced memory and processing requirements while building their groundbreaking mixture-of-experts models DeepSeek-R1 and DeepSeek-V3.
Mixture of experts (MoE) basics: The MoE architecture uses different subsets of a model’s parameters to process different inputs. Each MoE layer contains a group of neural networks, or experts, preceded by a routing module that learns to choose which one(s) to use based on the training example. In this way, different experts learn to specialize in different types of input.
How it works: The authors trained DeepSeek-R1 and DeepSeek-V3 using a cluster of 2,048 Nvidia H800 GPUs composed of nodes that contained 8 GPUs each. MoE requires less memory than dense architectures, since a given input activates only a portion of a model’s parameters. This enabled the authors to train DeepSeek-V3 on 250 GFLOPs per token, while Qwen 2.5 72B required 394 GFLOPs per token and Llama 3.1 405B required 2,448 GFLOPs per token.
- The authors built a mixed-precision training algorithm to reduce the memory requirements of training MoE models. They used FP8 (8-bit) numbers to perform computations including linear transformations and 16- or 32-bit precision to perform others such as computing embeddings. (They say DeepSeek-V3 was the first open LLM to have been trained using FP8.)
- The authors noticed that communication between GPUs inside a node was four times faster than communication between nodes. To ensure fast communication when routing tokens to experts, they limited the algorithm to process them within up to 4 nodes.
- To utilize GPUs more fully, they divided each GPU’s input data so the chip processes computation and communication at the same time. Specifically, the chip computes attention or MoE layers on one part of the data and simultaneously sends the other part of the data to other GPUs or aggregates it from other GPUs as necessary.
- To further save inference memory, the models use multi-head latent attention, which saves memory during execution relative to other variants of attention. The authors compared their implementation to the variant GQA used in Qwen 2.5 72B and Llama 3.1 405B. Their method (70 kilobytes per token) used far less memory than Qwen-2.5 (328 kilobytes per token) or Llama 3.1 (516 kilobytes per token).
Behind the news: DeepSeek-V3 made waves when it was released in December. It performed better than Llama 3.1 405B, the leading LLM at the time, but its training cost was an astonishing $5.6 million, compared to the usual tens to hundreds of millions of dollars. Some observers were skeptical of the reported cost, pointing out that the $5.6 million dollar figure doesn’t include salaries, data acquisition and annotation, processing failed training runs, and other research and development costs. In addition, the cost of training DeepSeek-R1 remains unknown.
Why it matters: Traditionally, only companies with large budgets and vast resources could afford to train state-of-the-art models. DeepSeek changed that but didn’t explain how when it released its models. By sharing the details, the company has empowered a wider range of teams to improve the state of the art.
We’re thinking: Shortly after DeepSeek-R1 was released, some engineers claimed — without presenting evidence — that DeepSeek had copied their work. DeepSeek’s disclosure of its training methods should lay to rest any remaining questions about this. Its work was truly innovative, and we applaud its release of key technical details.