Low Precision, High Performance: Researchers at Microsoft and Tsinghua researchers propose 1.58-bit AI model that rivals full-precision competitors

Loading the Elevenlabs Text to Speech AudioNative Player...

Reducing the number of bits used to represent each parameter in a neural network from, say, 16 bits to 8 bits shrinks the network’s size and boosts its speed. Researchers took this approach to an extreme: They built a competitive large language model whose weights are limited to three values.

What’s new: Shuming Ma, Hongyu Wang, and colleagues at Microsoft, University of Chinese Academy of Sciences, and Tsinghua University updated their earlier BitNet b1.58, in which most weight values are limited to -1, 0, or +1, competing with the top full-precision models up to 2 billion parameters. Weights are free to download for noncommercial and commercial uses according to an MIT license.

Key insight: Linear layers have a big impact on a transformer’s overall speed. They make up large parts of attention layers and fully connected layers, so they account for most computations. The authors’ 2023 work on BitNet showed that using 1-bit weights — whose values are limited to -1 and +1 — makes multiplications very fast (because multiplying by -1 simply flips the sign and multiplying by +1 changes nothing), but performance suffers. They improved on the idea the following year with BitNet b1.58, which allowed weights to be -1, 0, or +1. (Implemented perfectly, this approach allocates approximately 1.58 bits per parameter, since the number of bits needed to represent 3 values is log₂(3)=1.58.) In this case, multiplying by -1 or +1 still just flips or keeps the sign, and multiplying by 0 zeroes out the value. This ternary setup retains the original BitNet’s low memory requirements, fast training, and fast inference. With careful attention to hyperparameters, it also improves performance.

How it works: The authors pretrained the 2-billion parameter BitNet b1.58, which has an architecture similar to LLaMA, on a dataset of 4 trillion tokens that included web data plus synthetic math problems. To strengthen its reasoning abilities, they fine-tuned it on chat datainstruction-following data, and synthetic instruction-following data. Finally, they fine-tuned the model via DPO to better match human preferences.

  • During training, the authors used a quantized version of the model for forward passes and the non-quantized version for backward passes. Before each forward pass, they quantized the weights in linear layers to -1, 0, or +1. They ran the model, quantizing layer outputs to 8 bits. During backpropagation, they updated the weights of the non-quantized version, copied them, and quantized them before the next forward pass.
  • For ease of implementation, they ran attention, layer normalization, and other operations in 8-bit precision and stored the gradients and loss in 16 bits.
  • They used a two-phase schedule for the learning rate: an initial high learning rate helped BitNet b1.58 make updates large enough to affect the 1.58-bit weights after quantization — since small changes often had no effect — followed by a sharp drop in the learning rate mid-training to refine all weights on higher-quality data.
  • Similarly, they structured weight decay, which encourages weights to have lower values, in two phases. During the early phase, when the data quality was lower and learning rate higher, they used a strong decay to prevent overfitting. During the second phase, with higher-quality data and a lower learning rate, they disabled weight decay. This let all weights adapt to the data without interference from weight decay.

Results: Across 16 popular benchmarks for language understanding, mathematical reasoning, and coding, BitNet b1.58 was faster and used less memory than competitors, including Alibaba’s Qwen2.5-1.5B, Google’s Gemma-3 1B, Hugging Face’s SmolLM2 1.7B, Meta’s Llama 3.2 1B, and ModelBest’s MiniCPM 2B. It achieved better performance than all except Qwen2.5 1.5B.

  • Running on a laptop, BitNet generated 34.5 tokens per second on average, whereas Qwen2.5-1.5B generated 15.4 tokens per second on average.
  • BitNet’s memory requirement was 0.4GB, while Qwen2.5-1.5B required 2.6 GB.
  • BitNet achieved average accuracy of 54.19 percent, while Qwen2.5-1.5B  achieved average accuracy of 55.23 percent. SmolLM2 1.7B was next-best (48.7 percent average accuracy).
  • BitNet also outperformed a 4-bit quantized version of Qwen2.5-1.5B (52.15 percent average accuracy).

Why it matters: Quantizing an LLM to a few bits is not as simple as applying the current best practices for full-precision models. It demands rethinking LLM training, down to hyperparameter details like learning rate and weight decay. Even these seemingly small changes can have a large impact on final performance. By delving into these nuances, the authors provide a guide for how to ensure good performance from low-precision models.

We’re thinking: This work makes more than a bit of progress!