The Re-Opening of OpenAI: GPT-OSS, OpenAI’s first open-weights models since GPT-2, arrives in 120 billion and 20 billion parameter versions

The “open” is back in play at OpenAI.

Table comparing accuracy and Elo scores of gpt-oss-120b and gpt-oss-20b models across benchmarks and tool use.
Loading the Elevenlabs Text to Speech AudioNative Player...

The “open” is back in play at OpenAI.

What’s new: OpenAI released its first open-weights model since 2019’s GPT-2. The gpt-oss family comprises two mixture-of-experts (MoE) models, gpt-oss-120b and gpt-oss-20b, that are designed for agentic applications and free to use and modify.

  • Input/output: Text in (up to 128,000 tokens), text out (up to 33,000 tokens)
  • Architecture: gpt-oss-120b: MoE transformer, 117 billion parameters total, 5.1 billion parameters active per token; gpt-oss-20b: MoE transformer, 21 billion parameters total, 3.6 billion parameters active per token
  • Performance: Generally ahead of o3-mini, behind o3 and o4-mini
  • Availability: Web demo (free), weights available for commercial and noncommercial use under Apache 2.0 license
  • Features: adjustable chain-of-thought reasoning levels (high, medium, low), full access to the chain of thought, tool use
  • Undisclosed: Details of training data and methods

How it works: The team pretrained the gpt-oss models on trillions of tokens of text including general knowledge, coding, math, and science. Fine-tuning focused on reasoning and tool use.

  • The team quantized the weights in MoE layers to use 4.25 bits per parameter. Since 90 percent or more of the parameters fall within MoE layers, this step enables gpt-oss-120b to run on a GPU with 80 gigabytes of memory and gpt-oss-20b to run on a GPU with 16 gigabytes of memory.  
  • They fine-tuned the models to generate a chain of thought via supervised fine-tuning and reinforcement learning, a method similar to that used to fine-tune OpenAI o3.
  • During fine-tuning, they trained the models to support three reasoning levels by inserting into prompts phrases like “Reasoning:low”. 
  • Similarly, they fine-tuned them to search the web, execute Python code, and use arbitrary tools.
  • They also trained the model to refuse requests for hate speech, instructions for committing crimes, recipes for hazardous substances, and the like. In internal tests designed to measure risky behavior, gpt-oss-120b, after being fine-tuned for biology and cybersecurity, fell short of “high capability” in those areas.

Results: Set to high reasoning effort, the models generally performed midway between o3-mini, o3, and o4-mini in OpenAI’s tests. Unless otherwise noted, OpenAI results come from OpenAI’s reporting, and DeepSeek R1’s results come from its report on its latest update of the model.

  • Using tools to solve competition math problems in AIME 2024, gpt-oss-120b (96.6 percent accuracy) and gpt-oss-20b (96 percent accuracy) exceeded o3 (95.2 percent), but they fell short of o4-mini (98.7 percent).
  • Answering science questions on GPQA Diamond without using tools, gpt-oss-120b (80.1 percent accuracy) outperformed o3-mini (77 percent) but underperformed o3 (83.3 percent) and o4-mini (81.4 percent). The smaller gpt-oss-20b (71.5 percent) came in last among OpenAI models presented. This puts gpt-oss behind Grok 4 (87.7 percent), Gemini 2.5 Pro (84.4 percent), and DeepSeek R1’s latest update (81.3 percent), according to Artificial Analysis.
  • On the retail portion of Tau-Bench, a test of agentic tool use, gpt-oss-120b (67.8 percent accuracy) came in above o3 (65.6 percent) and below o4-mini (70.4 percent). These models outperformed DeepSeek R1 (63.9 percent accuracy). In comparison, gpt-oss-20b (54.8 percent accuracy) came in well below.

Behind the news: Founded in 2015 as a nonprofit corporation, OpenAI initially was devoted to open source development on the theory that AI would produce greater benefits and advance more safely if members of the community at large could inspect, use, and improve upon each others’ work. However, in 2019, the high cost of building cutting-edge AI models led the organization to form a for-profit subsidiary, and it stopped releasing large language model weights (although it continued to publish weights for models such as Clip, which produces similar embeddings for related images and text, and Whisper, a speech-to-text engine).

Why it matters: Businesses, developers, and users have a variety of reasons to choose models with open weights, including lower cost, greater control, and the ability to update as they wish. OpenAI’s turn away from open source cleared the way for other teams to capture the market for open offerings. Now it’s returning to a very different landscape. Meta jumped into the breach with its Llama models, along with Allen Institute for AI, Google, and others. Lately, developers in China such as Alibaba (Qwen3), DeepSeek (DeepSeek-R1), Moonshot (Kimi K2), and Z.ai have taken the lead. For developers, the gpt-oss family offers free access to technology designed by an extraordinary team of innovators. For OpenAI, it’s an opportunity to capture the broad range of developers and users that prefer open models to closed ones.

We’re thinking: A vibrant open source community is vital to AI’s ongoing progress! Every open model holds valuable knowledge and functionality.