Improve Agentic Performance with Evals and Error Analysis, Part 2

Dear friends,

In last week’s letter, I explained how effective agentic AI development needs a disciplined evals and error analysis process, and described an approach to performing evals. This week, I’d like to summarize the core ideas behind error analysis and describe some best practices. Given the rapid pace of improvement in LLMs, when error analysis points to a problem, your options for how to address it are greater than before. Let me explain.

Take the problem of building a basic Deep Research agent that searches the web to write a detailed report on a topic like “recent developments in black-hole science.” An agent might take a sequence of steps to generate the final report, such as (i) use an LLM to generate a handful of web search queries related to the topic, (ii) call a web-search API to get lists of results, (iii) use an LLM to identify the most promising sources to fetch, and (iv) ask the LLM to use these sources to write the report.

If the final report is subpar compared to the work of a human researcher following the same steps, the gap in performance could be from any of the steps. A basic error analysis procedure might involve gathering a sample set of topics where the output is subpar, and reading the results of every step of the workflow — called the traces — to see which step most frequently generated results materially worse than a human would have. This is very valuable for deciding what step to focus on improving.

A common misconception of error analysis is that it takes a lot of work to get started. The key principle is to look at the steps of the workflow and see which steps did a bad job on a given input, often by benchmarking to human level performance (HLP). Assuming we are automating a task where HLP is desirable, then the most important thing is to systematically examine traces to understand when the agent is falling short of HLP. And just as we can get started with evals using a quick-and-dirty initial cut at it (maybe using just a handful of examples) followed by iterating to improve, so too with error analysis.

Specifically, it is fine to start by reading one or just a handful of traces informally to get a sense of what might be going wrong. For example, if you see that the web search query terms in your Deep Researcher — step (i) above — frequently make no sense, that points you to an initial area to focus your efforts on improving. As the system matures, you can move incrementally toward more rigorous error analysis. For example, you might eventually end up with a regularly refreshed dataset of thousands of examples where the performance is poor, and carry out rigorous evaluations that show exactly what percentage of the time each of the steps (i) - (iv) contributed to problems with the final output, and also in what specific ways those steps fell short.

This type of analysis is extremely useful for deciding where to focus your efforts to improve the overall agentic workflow’s performance!

In addition to improving the execution of individual steps, we can change how we decompose a complex task into steps. When it came to pipelines built using machine learning or deep learning rather than LLMs, I found that the structure of the workflow — that is, how you decompose an overall task into a sequence of steps to be carried out — changed rarely. It was a big deal to rearchitect this! But in the past couple of years, because LLMs are improving so rapidly, I see much more rapid iteration on the design of workflows.

For example, one very common pattern is ripping out scaffolding and letting the LLM do more. This is often a good move when you now have access to a smarter LLM than you did when you first built a workflow. For example, perhaps you once used an LLM to clean up downloaded web pages by removing navigational links, ads, extraneous HTML, and the like, before a separate LLM used the cleaned-up pages to write a report. Since LLMs have become smarter, you might decide to skip the first step and dump messier HTML into the final LLM without an initial clean-up step, which can introduce its own errors.

Another example: Perhaps a year ago, we used hard-coded rules to decide what web pages to fetch and when to fetch more, but today we might let an LLM-based agent make this decision more autonomously. As LLMs get smarter, I see many teams rearchitecting workflows to remove hard-coded steps or constraints that were previously needed to keep the system from going off the rails. One way to spot opportunities for doing this is if error analysis shows that a sequence of steps collectively underperforms compared to what a human might do, even though the performance of each individual step is good. This might indicate that the way those steps are carried out is too rigid.

I go through many more examples in the Agentic AI course. Check it out if you want to learn more about evals and error analysis.

Keep building!

Andrew