Qwen3-Next Accelerates: Alibaba’s new model uses hybrid attention layers and a sparse MoE architecture for speed and performance

Alibaba updated its popular Qwen3 open-weights models with a number of fresh, speed-boosting tweaks.

Analytics DeepLearning.AI

17 Sep 2025 — 3 min read

Alibaba updated its popular Qwen3 open-weights models with a number of fresh, speed-boosting tweaks.

What’s new: Alibaba released weights for Qwen3-Next-80B-A3B in Instruct and Thinking variations. They incorporate some of the latest research on alternate forms of attention and mixture-of-experts approaches to use less processing power at inference.

Input/output: Text in (pretrained on up to 262,144 tokens, extensible up to 1 million via YaRN method), text out (up to 16,384 recommended for Qwen3-Next-80B-A3B)
Architecture: Mixture-of-experts transformer with mixed attention and Gated DeltaNet layers, 80 billion total parameters total, 3 billion parameters active per token
Performance: Roughly 3 to 10 times faster than Qwen3-32B at inference (depending on input size) while achieving better performance in most tasks
Availability: Weights for Qwen3-Next-80B-A3B-Thinking and Qwen3-Next-80B-A3B-Instruct available for commercial and noncommercial uses under Apache 2.0 license from HuggingFace and ModelScope;
API: Qwen3-Next-80B-A3B-Thinking $0.50/$6 per 1 million input/output tokens, Qwen3-Next-80B-A3B-Instruct $0.50/$2 per 1 million input/output tokens via Alibaba
Undisclosed: Specific training methods, training data

How it works: The team modified the Qwen3-30B-A3B architecture and training method to increase training efficiency and stability as follows:

The team increased the number of experts from 128 to 512, so at inference the model only uses 3.7 percent of its total parameters per token (though the number of active parameters is unchanged).
They replaced 75 percent of the vanilla attention layers with Gated DeltaNet layers, a form of linear attention that runs slightly slower than Mamba2 but yields better performance.
They replaced the remaining vanilla attention layers with gated attention layers. Gated attention layers add in a learned gate after computing attention, effectively enabling the model to decide which parts of the layer’s output they want to pass along to subsequent layers.
The team pretrained this modified architecture on 15 trillion tokens of Qwen3’s training dataset to predict multiple tokens at once. (They do not specify the number but recommend predicting two at a time at inference.) They fine-tuned the models using the reinforcement learning method GSPO.

Results: Qwen3-Next models were faster than Qwen3-30B-A3B and Qwen3-32B in Alibaba’s tests. They performed in the middle of the pack in independent tests.

Qwen3-Next showed notable speed at inference, especially with large inputs. Given 4,000 tokens of input, Qwen3-Next generated tokens as fast as Qwen3-30B-A3B and three times faster than Qwen3-32B. Given 128,000 tokens of input, it was 3 times faster than Qwen3-30B-A3B and 10 times faster than Qwen3-32B. Qwen3-Next trained much faster as well, 90.7 percent faster than Qwen3-32B and 87.7 percent faster than Qwen3-30B-A3B.
According to the Artificial Analysis Intelligence score (an average of 10 popular benchmarks that test general knowledge, math, and coding), Qwen3-Next-80B-A3B-Thinking turned in middling performance compared to proprietary reasoning LLMs. It outperformed Gemini 2.5 Flash Thinking, Z.ai GLM 4.5, but underperformed Anthropic Claude 4 Sonnet, Gemini 2.5 Pro, and OpenAI GPT-5.
Similarly, Qwen3-Next-80B-A3B-Instruct scored in the middle of the pack compared to proprietary non-reasoning LLMs. It outperformed OpenAI GPT-4.1, tied with DeepSeek-V3.1, and underperformed the much larger Moonshot Kimi K2.

Behind the news: Since transformers gained traction, researchers have been working to design faster variants of attention and new layers (like Mamba). However, the resulting models tend to be limited in size and performance relative to the state of the art when the innovations were proposed, sometimes because adapting them to existing GPU hardware is difficult. Qwen3-Next takes advantage of recent research without these limitations. It outperforms current large and popular models, potentially pointing a way toward future LLM architectures.

Why it matters: Qwen3-Next offers a recipe for faster inference without compromising performance. Mixture-of-experts architectures enable models to learn more while using fewer parameters at inference, increasing throughput. Swapping vanilla attention for more-efficient layers boosts throughput further, especially as context lengths increase. Predicting multiple tokens at once provides an additional edge.

We’re thinking: Rapidly rising demand for cheaper and faster token generation is pushing more teams to tune mixture-of-experts architectures so they use fewer active parameters. Such techniques will continue to grow in importance as demand for inference increases.

Qwen3-Next Accelerates: Alibaba’s new model uses hybrid attention layers and a sparse MoE architecture for speed and performance

Analytics DeepLearning.AI

Read more

OpenAI security agent finds and plugs holes: Cognition’s SWE-1.5 model brings more speed for coding agents

Cursor introduces a new model built for agents: Claude models sometimes know they’ve been tampered with

Introducing DeepLearning.AI Pro

Monsters of AI: AI Psychosis, Lethal Drones, Decaying Data, Speculative Bubbles