Researchers Confirm Reasoning Models That Generate More Tokens Have A Bigger Environmental Footprint

In the era of reasoning models, delivering better answers to questions has an environmental cost. A new study quantifies the impact.

What’s new: Researchers estimated the emissions of carbon dioxide and other heat-trapping gases associated with using 14 open-weights large language models. (The information needed to study closed models is not publicly available.) Reasoning, total tokens generated, and accuracy on question-answering benchmarks were associated with higher greenhouse-gas emissions, according to findings by Maximilian Dauner at Munich Center for Digital Sciences and AI and Gudrun Socher at HM Hochschule München University of Applied Sciences.

How it works: The authors tested models of various sizes, with and without reasoning capabilities, using questions that required short and long answers.

The authors tested Meta’s non-reasoning models Llama 3.1 (8 billion and 70 billion parameters) and Llama 3.3 (70 billion parameters); Alibaba’s non-reasoning models Qwen and Qwen 2.5 (7 billion and 72 billion parameters); Deep Cogito, which has reasoning and non-reasoning modes (8 billion and 70 billion parameters); and the reasoning model DeepSeek-R1 (7 billion, 8 billion, and 70 billion parameters).
Each model answered 100 MMLU questions about five subjects (philosophy, world history, international law, abstract algebra, and mathematics). The questions took two forms: multiple-choice with single-word answers and prompts that elicited open-ended responses. OpenAI’s o4-mini judged the open-ended responses.
The authors ran the models on an Nvidia A100 GPU with 80 gigabytes of memory and measured the amount of energy used by the chip. They multiplied the energy consumption in kilowatt-hours by a global average (480 grams of CO₂-equivalent per kilowatt-hour) to determine the resulting emissions.

Results: The authors found a clear trade-off between reasoning (and the higher resulting numbers of tokens generated and output accuracy) and greenhouse-gas emissions.

The top-performing models achieved around 84 percent to 91 percent accuracy, resulting in around 1,300 grams to 2,000 grams of CO₂-equivalent greenhouse gas emissions per 1,000 questions (500 multiple-choice questions and 500 open-ended questions). By contrast, the smallest model achieved less than 35 percent accuracy and resulted in less than 30 grams of emissions.
Deep Cogito’s emissions multiplied by 4 to 6 times when reasoning was enabled. For example, the 8 billion-parameter version emitted around 372 grams of emissions with reasoning versus around 56 grams without reasoning.
Open-ended responses resulted in still greater emissions. Models generated over 3 times more emissions while answering open-ended questions (an average of 345.55 grams) than they did when answering multiple-choice questions (109.52 grams).
Deep Cogito with 70 billion parameters bucked the trend. With reasoning enabled, it achieved the highest overall accuracy (84.9 percent) while emitting around 34 percent fewer grams than DeepSeek-R1 with 70 billion parameters (78.9 percent accuracy). This result suggests that energy efficiency can vary dramatically among reasoning models.

Yes, but: The authors’ estimates of carbon emissions likely are overestimates. Older GPUs such as the A100 are less energy-efficient than newer ones; and much cloud computing takes place in data centers powered by renewable energy sources that emit less carbon than global average energy consumption. For example, Google and Amazon match their electricity consumption with renewable energy, and Meta has powered its data centers solely by renewable energy since 2020.

Why it matters: The International Energy Agency projects that AI will consume increasing amounts of energy, and thus produce more greenhouse-gas emissions, as companies focus on training and serving ever larger models. Current AI poses a double-barreled challenge: The more accurate a model’s output, (i) the more emissions it will produce and (ii) the more people will query it. Much of the thinking about how to manage this issue has pointed to leaner parameter counts: Smaller models consume less energy. But the authors’ findings instead point to strategic deployment: The right model for the right task. AI providers can reduce emissions by routing inputs to models that can process them both accurately and efficiently, and by limiting outputs to appropriate lengths. These strategies don’t require building new infrastructure or models.

We’re thinking: We must continue to work toward improving AI’s energy efficiency and reducing its carbon emissions. That said, in many tasks, using AI produces fewer emissions than other approaches, such as using human labor.