Authors Devised GEPA, An Algorithm For Better Prompts To Improve Agentic Systems’ Performance

Honing an agent’s prompt can yield better results than fine-tuning the underlying large language model via reinforcement learning.

What’s new: Lakshya A. Agrawal and colleagues at UC Berkeley, Stanford, BespokeLabs.ai, Notre Dame, Databricks, and MIT developed GEPA, an algorithm that improves the performance of agentic systems by improving their prompts. The authors position it as an efficient alternative to fine-tuning an agent’s large language model via reinforcement learning.

Key insight: Agentic models trained via reinforcement learning typically must take a complicated series of actions to earn a simple reward, including calling a large language model multiple times for different purposes, or modules, of the workflow. But a well designed prompt can take into account the various problems an agent may run into and thus guide the model more efficiently. The trick is to write prompts that anticipate such problems. To accomplish this, a large language model can analyze an agent’s behavior as it responds to a given prompt, identify associations between the prompt and outcome (for instance, a failed tool call), and compose a more effective prompt.

How it works: Prompting agents based on Alibaba’s Qwen3-8B, the authors used GEPA to hone their performance on specific benchmarks. The method iteratively evolves a pool of candidate prompts, beginning with a simple prompt for each LLM call a module of the agent makes, such as “Respond to the query” or “Ensure the response is correct and adheres to the given constraints [specified in the benchmark inputs].” In each cycle, GEPA selects, modifies, and evaluates a prompt to generate a revised prompt that produces better results.

Given each prompt to be fed to the LLM (initially the default prompts, later revised prompts selected for their effectiveness), the agent responds to a random subset of examples from a benchmark’s training set.
GEPA selects which prompt to modify, alternating between the various modules. A separate Qwen3-8B instance examines the agent’s traces (generated text, tool calls, and results) and revises the prompt.
GEPA evaluates the revised prompt in a two-step process. First it feeds it to the agent along with the examples used previously and the prompts by other modules. If the revised prompt improves the agent’s performance, GEPA adds it to a pool of candidate prompts and then scores its performance on each example in the benchmark’s validation set.
From the pool, GEPA identifies prompts that achieved the highest score on at least one example. It selects a set of prompts (one for each module) for the next round of revision, prioritizing prompts that excelled on multiple questions.
GEPA repeats the previous steps until it has exhausted a predefined processing budget. It chooses the set of prompts that achieved the highest average score across all examples in the validation set.

Results: The authors pitted custom and open-source agents that used GEPA against versions for which Qwen3-8B was fine-tuned on a given benchmark via Group Relative Policy Optimization (GRPO). They measured both the agents’ performance and the number of agent executions required.

Across HotpotQA (questions that require reasoning over multiple paragraphs), IFBench (following instructions), HoVer (verifying facts), and PUPA (which gauges balance between helpfulness and unwanted sharing of personal information), agents that used GEPA consistently achieved better performance on all four.
Moreover, they did this with far greater efficiency, requiring up to 35 times fewer agent executions.

Yes, but: The authors compared GEPA to fine-tuning via reinforcement learning using a single, relatively small model. Questions remain regarding how the results would scale to larger models or generalize to other models, and how GEPA would compare to supervised fine-tuning.

Why it matters: Methodically revising prompts can help agents perform better than fine-tuning via reinforcement learning, and it requires far fewer examples and executions.

We’re thinking: While it’s unclear how this method compares to supervised fine-tuning, the ability to boost agentic performance without reinforcement learning may be especially valuable in low-data situations or where agent executions are expensive.