Recipe for Smaller, Capable Models: Mistral uses cascade distillation on Mistral 3 to build Ministral family

Loading the Elevenlabs Text to Speech AudioNative Player...

Mistral compressed Mistral Small 3.1 into much smaller versions, yielding a family of relatively small, open-weights, vision-language models that perform better by some measures than competing models of similar size. The method combines pruning and distillation.

What’s new: Mistral AI released weights for the Ministral 3 family in parameter counts of 14 billion, 8 billion, and 3 billion. Each size comes in base, instruction-tuned, and reasoning variants. The team detailed its recipe for distilling the models in a paper.

  • Input/output: Text and images in (up to 256,000 tokens, up to 128,000 tokens for reasoning variants), text out
  • Architecture: Decoder-only transformer
  • Performance: Ministral 3 14B Base (14 billion parameters) closely matches Mistral Small 3.1 Base (24 billion parameters), with the smaller models close behind
  • Features: Tool use, languages including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic
  • Availability: Weights free to download under Apache 2.0 license, API access $0.20/$0.20 per million input/output tokens (Ministral 3 14B), $0.15/$0.15 per million input/output tokens (Ministral 3 8B), $0.10/$0.10 per million input/output tokens (Ministral 3 3B)
  • Undisclosed: Training data

How it works: The team built the model using an approach it calls cascade distillation. Starting with a larger parent, they alternately pruned (removed less-important parameters) and distilled (trained a smaller model to mimic the larger model's outputs) it into progressively smaller children.

  • The team pruned Mistral Small 3.1 (24 billion parameters) to create Ministral 3 14B, which became the starting point for Ministral 3 8B, and so on.
  • They pruned by removing layers that changed their input least. Then they reduced the size of internal representations and the width of fully connected layers.
  • Then they trained the pruned model to mimic the Mistral Small 3.1. Pretraining the pruned models to mimic Mistral Small 3.1 produced better results than pretraining them to mimic the larger, more capable Mistral Medium 3 (parameter count undisclosed). However, during fine-tuning stages, the pruned models did benefit from learning to mimic Mistral Medium 3.
  • To fine-tune the models to follow instructions, the team first trained the models on examples of desired behavior, then refined them using ODPO, a technique that uses an LLM to compare better and worse responses to steer the model toward preferred outputs.
  • To produce reasoning variants, the team trained the models on step-by-step reasoning examples of mathematics, coding, multilingual tasks, tool use, and visual reasoning, then applied GRPO to improve performance further.

Performance: Ministral 3 14B (version unspecified) ranks ahead of Mistral Small 3.1 and Mistral Small 3.2 on the Artificial Analysis Intelligence Index, a weighted average of 10 benchmarks. Mistral compared Ministral 3 with Mistral Small 3.1 and open-weights competitors of equal size. Ministral 3 14B base outperformed Mistral Small 3.1 by 1 to 12 percentage points on tests of math and multimodal understanding, and tied on Python coding. It also outperformed its parent on GPQA Diamond. Compared to open-weights competitors:

  • Ministral 3 14B: On TriviaQA, Ministral 3 14B base (74.9 percent accuracy) outperformed Qwen 3 14B (70.3 percent accuracy) but trailed Gemma 3 12B (78.8 percent accuracy). On MATH, Ministral 3 14B base (67.6 percent accuracy) exceeded Qwen 3 14B (62 percent accuracy). The two were comparable in other areas. On AIME 2025 (competitive high-school math problems), Ministral 3 14B reasoning achieved 85 percent accuracy, while Qwen 3 14B Thinking achieved 73.7 percent accuracy. 
  • Ministral 3 8B base outperformed the larger Gemma 3 12B on most benchmarks except TriviaQA. 
  • Ministral 3 3B base was competitive with Gemma 3 4B and Qwen 3 4B, but much stronger on MATH.

Why it matters: Cascade distillation offers a way to produce a high-performance model family from a single parent at a fraction of the usual cost. Training the Ministral 3 models required 1 trillion to 3 trillion training tokens compared to 15 trillion to 36 trillion tokens for Qwen 3 and Llama 3 models of similar sizes. Their training runs were also shorter, and their training algorithm is relatively simple. This sort of approach could enable developers to build multiple model sizes without proportionately higher training costs.

We’re thinking: Ministral 3 models can run on generic laptops and smartphones. On-device AI at the edge keeps getting more capable and competitive!