More Affordable Reasoning: Canadian researchers find capping context helps models better retrieve data
One way to improve a reasoning model’s performance is to let it produce a longer chain of thought. However, attending to ever-longer contexts can become expensive, and making that attention more efficient requires changes to a model’s architecture. Researchers proposed a way to limit the cost of processing long chains of thought with just a bit of training.
What’s new: Delethink is a reinforcement learning (RL) method that trains large language models to periodically truncate reasoning tokens to a fixed maximum number. The authors include Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, and colleagues at Mila, Microsoft, McGill University, ServiceNow Research, Polytechnique Montréal, and Université de Montréal.
Key insight: Reasoning tokens typically accumulate within a large language model’s context window, where they consume quadratically more computation as the contents of the window expand. One way to counter this effect is to train the model to reason within a maximum context window size. In effect, as a model is reasoning, it can learn to replace its chain of thought periodically with its latest “thoughts” and then continue.
How it works: The authors fine-tuned R1-Distill 1.5B, a large language model, on math problems in the DeepScaleR dataset. They used a modified version of the reinforcement learning algorithm GRPO that trained the model to reason in 4,000-token chunks:
- Given a math problem, the model generated a chain of thought until it had either finished or filled the model’s context window with 8,000 tokens.
- If it didn’t finish its chain of thought, the authors replaced the context with the original query plus the last 4,000 tokens. Then the model continued to generate its chain of thought until it had either finished or the context window once again held 8,000 tokens.
- They repeated this process until the model had either finished its chain of thought or produced 24,000 reasoning tokens.
- Then the model attempted to solve the problem, receiving a reward for a correct solution.
Results: The authors compared their R1-Distill 1.5B models to the same model after fine-tuning on the same 24,000-token reasoning budget via using GRPO. They tested the models on reasoning budgets of 24,000, 96,000, and 128,000 tokens.
- With a budget of 24,000 tokens, their model matched or surpassed the baseline on all 3 math benchmarks tested. For example, on AIME 2025, Delethink (31 percent accuracy) outperformed the baseline (29 percent accuracy).
- Their model’s performance continued to improve as the authors increased the reasoning budget, while the baseline achieved much smaller gains. For instance, with a budget of 128,000 tokens, their model achieved 35 percent accuracy, while the baseline achieved 30 percent accuracy.
- The authors estimated that training their model with a 96,000-token reasoning budget would cost 7 H100-months, while the baseline would require 27 H100-months.
Why it matters: This work eases the quadratic compute barrier that can make extremely long reasoning computationally infeasible. While other methods, like linear attention, achieve the same result by changing the attention mechanism, Delethink restructures the reasoning process to limit processing regardless of a model’s attention mechanism. It opens a path to reason efficiently over longer contexts without requiring new model architectures.
We’re thinking: As the authors mention, most LLMs are pretrained using relatively short contexts. For example, Llama 3 models started pretraining with examples of 8,000 tokens. This may have made them good at processing inputs around 8,000 tokens long. That is to say, Delethink’s performance may have been helped by the fact that LLMs tend to be pretrained on short-context tasks.