Mixture of Video Experts: Alibaba’s Wan 2.2 video models adopt a new architecture to sort noisy from less-noisy inputs

The mixture-of-experts approach that has boosted the performance of large language models may do the same for video generation.

Three AI-generated video clips: a man vaulting over a moving car, a gymnast flipping on a plane wing, and a rabbit ice skating in pink boots.
Loading the Elevenlabs Text to Speech AudioNative Player...

The mixture-of-experts approach that has boosted the performance of large language models may do the same for video generation.

What’s new: Alibaba released Wan 2.2, an open-weights family of video generation models that includes versions built on a novel mixture-of-experts (MoE) flow-matching architecture. Wan2.2-T2V-A14B generates video from text input, Wan2.2-I2V-A14B generates video from images, and Wan2.2-TI2V-5B generates video from either text or images. At 5 billion parameters, Wan2.2-TI2V-5B runs on consumer GPUs.

  • Input/output: Wan2.2-T2V-A14B: Text up to 512 tokens in, video up to 5 second out (30 frames per second, up to 1280x720 pixels per frame). Wan2.2-I2V-A14B: Images up to 1280x720 pixels in, video up to 5 seconds out (30 frames per second, up to 1280x720 pixels per frame).  Wan2.2-TI2V-5B: Text up to 512 tokens and/or images up to 1280x704 pixels in, video up to 5 seconds out (24 frames per second, 1280x704 pixels per frame).
  • Architecture: UMT5 transformer to encode text, 3D convolutional variational autoencoder (VAE) to encode and decode images, flow-matching model to generate output: MoE transformer, 27 billion parameters total, 14 billion active per token (Wan2.2-T2V-A14B and Wan2.2-I2V-A14B) or transformer (Wan2.2-TI2V-5B).
  • Availability: Web interface (free), weights available via HuggingFace and ModelScope for commercial and non-commercial uses under Apache 2.0 license, API (MoE models only) $0.02 per second of 480p output, $0.10 per second of 1080p output (API only)
  • Undisclosed: VAE parameter count, training data, differences in training methods between Wan 2.2 and the earlier Wan 2.1

How it works: The team pretrained the VAE to encode and decode images. They pretrained the flow-matching model, given a video embedding from the VAE with noise added and a text embedding from UMT5, to remove the noise over several steps.

  • The MoE model has two experts: one for very noisy inputs and one for less noisy inputs. One expert generates the objects and their positions across a video, the other handles details.
  • To determine which expert to use, the model computes the signal-to-noise ratio of the noisy embedding. Specifically, it starts with the high-noise expert, determines the time step at which the proportion of noise has fallen by half, and switches to the low-noise expert after that time step.
  • At inference, the VAE embeds an input image (if applicable) and UMT5 embeds input text (if applicable). The model concatenates the image embedding (if applicable) with an embedding of noise. Given the noisy embedding and text embedding, the flow-matching model removes noise over several steps. Finally, the VAE decodes the denoised embedding to produce video output.

Results: Results for Wan 2.2 are limited. The team shared only the performance of the MoE models on a proprietary benchmark, Wan-Bench-2.0, whose mechanics, categories, and units it has not yet described. The team compared Wan2.2-T2V-A14B to competitors including Bytedance Seedance 1.0, Kuaishou KLING 2.0, and OpenAI Sora.

  • For esthetic quality, Wan2.2-T2V-A14B (85.3) outperformed second-best Seedance 1.0 (84.3).
  • It also achieved the highest scores for dynamic output, rendered text, and the prompt control over the camera.
  • For video fidelity, Wan2.2-T2V-A14B (73.7) came in second to Seedance (81.8).

Behind the news: Open models for video generation have been proliferating. Within the last year, there are MochiHunyuanVideoLTX-Videopyramid-flow-sd3CogVideoX, and more.

Why it matters: MoE architectures have become popular for their superior performance in text generation. Selecting the expert(s) to use for a given input often is done either by a router that learns which expert(s) work best for a given token or based on the input data type. This work is closer to the latter. The model selects the appropriate expert based on the noise in the input.

We’re thinking: Video generation is exploding! Proprietary systems generally have made deeper inroads into the professional studios, but open models like this show great promise.