In Defense of AI Evals, for Everyone
— with support from Hamel Husain and Bryan Bischof
Recently I’ve seen a wave of “anti-evals” posts in my feed, e.g., this post, this other post, this other post, and the many comments and quotes branching from them. I don’t want to assume that folks are attacking the idea of evaluation itself; many of these posts come from people I respect, so I take them seriously. In this post I want to lay out what I mean by “evals,” when it actually makes sense to dial rigor up or down, and why I think an anti-eval sentiment is harmful for the community.
What do I mean by evals? Hamel and I define it simply as the systematic measurement of application quality. Notice that this doesn’t imply any particular metric or method, nor does it have to be a single number. It doesn’t have to be perfectly accurate either—it can be an estimate. The point is that it is systematic rather than ad hoc: you’re checking quality in a continuous, deliberate way.
When people say they “don’t do evals,” they are usually lying to themselves. Every successful product does evals somewhere in the lifecycle. To illustrate, first I’ll describe the lifecycle. At the highest level, there are two phases: pretraining and posttraining. Pretraining is mostly unsupervised; the model is trained to predict the next token over a massive corpus. Posttraining, in contrast, is the supervised phase where the base model is adjusted using supervised fine-tuning, reinforcement learning from human feedback, and preference data. The goal is to make the model more useful and aligned for specific applications. Here, task-specific evals come to the forefront. Providers optimize for accuracy across math, science, multiple coding domains (e.g., SWE, web development), instruction following, tool use, long-context retrieval, and more—and report these numbers publicly. They also compete in environments like LMArena, where response quality across chat, image, and web contexts is publicly evaluated. Moreover, model providers also have access to large amounts of private data from applications built on their APIs, and perhaps even incorporate this data into posttraining. At the very least, we know they carefully analyze these traces and use them to inform the design of new evals.
So if you’re building something like a coding agent, you’re already benefiting from all the upstream rigor. Coding evals are so heavily represented in posttraining that, in practice, someone else has already done a large part of the evals for you. This is one reason some teams feel like they can get away without doing much evaluation themselves. And when you look closer, you realize that even those teams are evaluating all the time, just not in ways they label as “evals.” Looking at outputs, noticing what feels off, dogfooding your own product, making changes: that is evaluation. It’s error analysis, and it happens continuously, and is systematic.
The more interesting question, then, is not whether you do evals, but when you can afford to be less rigorous and when you cannot. In practice, there are two main situations where you can get away with lighter processes. The first is when your task is already well represented in posttraining, as in the aforementioned coding agent example. The second is when you and your team have enough domain expertise and taste to rely on your own dogfooding—and are religious about continuously dogfooding. If you can look at outputs, reason clearly about what is off, and iterate effectively, you may not need much more than that to make progress.
This is why many successful products look like they are built without evals. Many are either in domains where the evals are already baked in by posttraining, or they are driven by teams who can steer by feel because they know the space inside out. Foundation model providers themselves pour enormous amounts of money into evaluation for every new capability area, because they know performance won’t improve without systematic measurement. The companies that specialize in helping them do this—Scale, Snorkel, Mercor, and others—are each valued in the billions of dollars. In my own experience with applications built on top of foundation models (with much less money, lol), evals are especially critical in complex document processing and analysis. Just because a document fits in the context window does not mean the model will complete the task correctly; we have to carefully decompose the task into smaller pieces the model can handle, and then design evals for each of those pieces. One striking example of where rigorous evals matter is in my colleagues’ work on building a database of police use of force and misconduct in California.
Overall, anti-eval sentiment is damaging. Many people in this community are new to building AI products, and most are looking to build on top of foundation models. They may not fall into the two categories I described earlier—their tasks may not be well represented in posttraining, and their teams may not yet have the experience or instincts to rely on dogfooding alone. Most don’t have backgrounds in data analysis, in doing any error analysis, or in building processes for continual iteration. For these teams in particular, dismissing evals is harmful because it removes the very tools that would help them understand what is working, what is not, and how to make progress.
I don’t want to assume that people are upset about the existence of an evals course, or that this is what has spurred the recent anti-eval sentiment. Perhaps people are allergic to the idea of an AI evals course, and I can empathize with that. Maybe the allergy comes from a misconception that the course is meant to hand down a rigid, one-size-fits-all prescription for how evaluation should be done. That’s silly—evals live on a spectrum, the same way there isn’t just one way to do ML or data analysis. What a course offers is a suite of techniques that you can adapt to your own problems: e.g., how to tell whether you are bottlenecked by specification or by model capability, how to do error analysis in a way that produces actionable insights, and how to scale that analysis with tools like LLM-as-Judge. Some students use just one or two of these techniques, others adopt more, but nearly everyone finds something that changes the way they see their data and their product, e.g., learn more in this video. If you feel you already have these techniques and your products are doing well, don’t sign up for the course. Our goal isn’t to have a monopoly on evals! We want you to teach these techniques to your colleagues. Teach your own courses! Build evals tools for others! The point is simply to grow the field together by giving more people the vocabulary and tools to evaluate systematically, whatever their setting.
In the end, evals aren’t about following a rigid philosophy. Sometimes it’s fine to be less rigorous, and sometimes the task demands much more rigor. Anti-eval sentiment misses this point. Whether you call it evals or not, if you are serious about building something that lasts, you are already doing it—and doing it well is what separates products that endure from those that disappear. The more of us who share techniques, teach each other, and build better tools, the stronger the whole community becomes.