May 16, 2026 · The CompletionKit team

What is prompt evaluation?

Prompt evaluation is the practice of scoring a prompt against a fixed dataset of example inputs before you ship it. You gather a set of representative inputs, write down metrics that define what a good output looks like, run the prompt over every input, and grade each result. The output is a score, one you can hold up against last week's score to see whether a change helped or hurt. That is the whole idea: it turns "this feels better" into a number you can actually compare.

It exists because the obvious alternative (reading the outputs yourself) doesn't scale. Eyeballing five responses after a prompt tweak is fine. Eyeballing them after every tweak, across fifty inputs, while staying honest about the cases you didn't think to check, is not. People stop doing it, or they do it badly, and either way the prompt drifts. A fixed dataset and a defined metric make scoring cheap enough to run on every change, which is the only way scoring actually keeps happening.

Evaluation is not observability

The two get conflated constantly, so it's worth being precise. LLM observability watches live production traffic. It tells you what already happened: how long requests took, what they cost, which ones errored, and a trace of each one so you can debug it. It's a rear-view mirror, and a good one; you need it.

Prompt evaluation happens before any of that. It runs against a fixed dataset you control, not live traffic, and it answers a different question: is this version of the prompt safe to ship? It's a test drive, not a rear-view mirror. Observability tells you a regression went out yesterday and is annoying real users. Evaluation tells you the regression exists this morning, while it's still a diff on your screen. Most serious teams run both, but they are different tools solving different problems, and a dashboard of production latency graphs is no substitute for a scored test run.

Evaluation runs before release, against a fixed dataset you control. Observability runs after, against live traffic. Different tools, different questions.

Where the idea came from

Prompt evaluation didn't appear with large language models. It inherits its core move, the held-out test set, from decades of classical machine learning. The rule there is old and unbending: never measure a model on the same data it learned from, or it will score perfectly by memorising and tell you nothing about the unseen case. So you hold a portion of your data back, untouched, and judge the model only on that. A prompt's evaluation dataset is exactly this held-out set, repurposed: a fixed bank of inputs the prompt has to perform on, kept stable so the score means something across versions.

The specifically prompt-shaped version of the idea crystallised on March 14, 2023, the day OpenAI launched GPT-4 and, alongside it, open-sourced a framework called Evals. Evals let anyone build a dataset of prompts, run a model over it, and measure the quality of the completions, and it gave the activity its now-ubiquitous short name. There was a telling detail in the launch: OpenAI offered GPT-4 access to people who contributed high-quality evals. They were, in effect, paying for test cases. That's a clear signal of how the field already valued the thing: not the model on its own, but a trustworthy way to measure it. For a deeper look at the grading half of this, see our guide to LLM-as-a-judge.

The anatomy: datasets, runs, and metrics

Three pieces do the work, and the vocabulary is worth getting straight because every prompt-eval tool uses some version of it.

A dataset is your held-out set: a collection of example inputs, each row one case the prompt has to handle. Good datasets over-represent the awkward cases (the empty input, the hostile user, the edge case that caused last quarter's incident) because those are what a prompt change is most likely to break.
A run is one execution of a prompt over every row of the dataset. It produces one output per row, all under the same conditions, so the run as a whole is a single comparable snapshot of how that prompt version behaves.
A metric is one named criterion the outputs are graded against: "stays faithful to the source", "returns valid JSON", "never promises a refund". A run carries several metrics, each scored independently. That independence is the point: a single blended "is this good?" number can tell you something dropped but never what. Separate metrics turn a regression into a pointer straight at the broken behaviour.

Grading open-ended outputs (summaries, answers, rewrites) is itself a hard problem, since two good responses can be worded completely differently and exact-match comparison falls apart. The common solution is to use a model as the grader, which is the subject of its own article. The docs walk through setting up datasets, runs, and metrics in CompletionKit step by step.

The build loop and the maintenance loop

The reason prompt evaluation pays off isn't really the first version of a prompt. It's everything after.

Shipping a prompt the first time is the easy part: you wrote it, you tried it, it looked good, it went out. The hard part is the year that follows: the model gets deprecated and you swap it, a customer hits a case you never imagined, you tighten the tone, you add a new instruction. Every one of those changes is a chance to quietly break a case that used to work. Without a dataset to run against, you simply won't know; the new behaviour looks fine, the regression is two rows you didn't re-check.

This is the distinction between a build loop and a maintenance loop. The build loop is getting to the first good version. The maintenance loop is keeping it good through every later change. Prompt evaluation matters in both, but it's indispensable in the second, because that's where regressions hide and where eyeballing fails most quietly. A prompt without an eval dataset isn't finished; it's just untested. The maintenance loop is also the half you can hand off: point a coding agent at the MCP server and it runs the evaluate-revise-rescore cycle for you.

FAQ

What is prompt evaluation?

Scoring a prompt against a fixed dataset of example inputs before you ship it. You collect representative inputs, define metrics for what a good output looks like, run the prompt over every input, and grade the results, producing a score you can compare against the previous version.

What's the difference between prompt evaluation and LLM observability?

Observability watches live production traffic and reports what already happened: latency, cost, errors, traces. Prompt evaluation runs before you ship, against a fixed dataset you control, and tells you whether a change is safe to release. A rear-view mirror versus a test drive. Most teams need both.

What is a run and what is a metric?

A run is one execution of a prompt over every row of a dataset, producing one output per row. A metric is a single named criterion the outputs are scored against. A run carries several metrics, each scored independently, so a drop in one of them points straight at what regressed.

Do I need prompt evaluation if I only have one prompt?

Yes. A single prompt is rarely edited only once, and every later tweak, model swap, or new edge case can quietly break a case that already worked. Evaluation earns its keep across the year of changes after the first version, not on the first version itself.

Score your prompts free Why we built CompletionKit

Built by Homemade Software. Disagree with any of this? support@completionkit.com or r/completionkit.

← All posts