May 18, 2026 · The CompletionKit team

How to build a prompt evaluation dataset

A prompt evaluation dataset is the fixed set of inputs your prompt is graded against, and the bar any change has to clear before it ships. Build one by collecting real inputs from your production traffic and support backlog, picking enough of them to cover the shapes of request your prompt actually handles, adding the edge cases that have already burned you on purpose, deciding row by row whether you have an expected output or a rubric, and then holding the rows fixed so the score next week is comparable to the score today.

That last part is the one people get wrong. A "dataset" that drifts every time someone adds rows is a vibe, not a benchmark. The whole point of the exercise is that v8 of your prompt and v7 of your prompt are scored on the same inputs, so when the number moves you know it was the prompt that moved it. Lose that and you've spent a week building a system that tells you nothing.

This post walks through how to actually build one. If you want the wider context (what evaluation is, how it sits next to observability, why the loop matters), start with what is prompt evaluation?. If you want the scoring half, read LLM-as-a-judge. This is the dataset half.

Source real inputs, not imagined ones

The single most important property of a dataset is that it looks like the traffic your prompt has to survive. The temptation, on day one, is to sit at a whiteboard and brainstorm twenty inputs that "feel representative". Resist it. Brainstormed inputs are tidy, well-spelled, on-topic, and roughly the length you imagine a user types. Real inputs are none of those things. If you grade on the tidy version and ship to the messy version, your score is a lie.

Three places to look first:

Production logs. If your prompt is live, the inputs it has already been called with are the ground truth for what it has to handle. Pull a few hundred at random from the last month and you have the basis of a dataset in an hour. If you don't log inputs in production yet, start now; even sampling a few percent gives you something to work with quickly.
Support tickets and complaints. Every email to support that includes "the AI said something weird" is a free, hand-labelled hard case. The customer has already done the work of telling you which input broke the prompt and what good would have looked like. Walk the support backlog and copy the inputs that triggered complaints straight into the dataset.
Exports from the system the AI feature replaced. If this prompt is automating something a human or an older tool used to do, the records that human or tool produced are gold. They give you both the inputs and, often, a defensible expected output for the rows where one exists.

Synthetic inputs (ones you or a model generated to fill an obvious gap) have a place, but a small one. Use them to plug holes you've spotted in the coverage of your real inputs, not as the spine of the dataset. A dataset made mostly of synthetic rows tells you how your prompt handles the inputs you imagined, which is exactly the question you don't need answered.

How many rows is enough?

Fewer than you think. Thirty to fifty real, well-chosen rows is a perfectly respectable starting dataset, and most teams overshoot here on the way to never finishing. The goal is not statistical significance on a paper; the goal is to catch regressions before they hit customers. A small dataset that covers the shapes of input you actually see, including the awkward ones, will catch more real bugs than a thousand rows of near-duplicate happy paths.

The right question is not "how many rows do I have?" but "what kinds of input am I missing?". Look across your rows and ask: are all the languages represented that my users speak? All the lengths: one-word queries and three-paragraph ones? All the tones: polite, terse, frustrated? All the formats: questions, commands, fragments, pasted JSON? Coverage of kinds matters more than raw row count. A dataset of 40 rows that hits every kind beats a dataset of 400 rows where 380 of them are mild variations of the same kind.

Grow the dataset deliberately, not opportunistically. The right time to add rows is when something has just gone wrong and you want to make sure it stops going wrong: a customer hit a case the dataset didn't cover, a teammate found a failure during review, the model started saying something odd in production. Every bug becomes a row. That's how the dataset earns its way to a thousand entries over a year, and every one of those entries is load-bearing.

Cover the edge cases on purpose

If you only sample randomly from production, your dataset will look exactly like your production traffic, which means the rare-but-important cases will be rare in the dataset too. That's the wrong distribution for an evaluation set. You want the edge cases overrepresented, because they are exactly the cases a prompt change is most likely to break.

A useful frame: think of your dataset as a portfolio that deliberately overweights the awkward. For a customer-support reply prompt, the portfolio might be:

The boring middle. The standard, on-topic requests that make up the bulk of traffic; enough rows to be confident the prompt still handles the easy case.
Adversarial and off-topic. Prompt-injection attempts, requests for refunds you can't grant, abusive messages, jailbreak attempts. These need their own rows because they are exactly the inputs where a prompt change can silently regress safety.
Out-of-distribution but real. The customer who typed in the wrong language, the one who pasted a JSON blob, the one whose entire message is two characters. These rows are rare in traffic and devastating when they fail.
Things you have already gotten wrong. Every row in this bucket has a story behind it: a complaint, an incident, a "the model said something weird" Slack message. These are the rows that justify the whole exercise.

If you have to pick one bucket to invest in, make it the last one. The cases that have already broken you are the cases most likely to break you again, and they are the cheapest to collect because someone has already done the work of finding them.

You don't sample an eval dataset from traffic, you assemble it like a portfolio. The boring middle stays small on purpose; the awkward buckets are overweighted, with the most rows going to inputs that have already broken you. The split shown is illustrative.

Expected outputs vs. judge-only rows

Some tasks have a right answer. Some don't. The schema of your dataset should reflect that, row by row.

For closed-ended tasks (classification, structured extraction, valid-JSON output, did-the-model-call-the-right-tool) there is an expected output, and you check it exactly. The dataset has an expected column, the run produces an actual column, and the metric is "do they match" or "does the actual fit the schema". This is the cheapest, most reliable kind of scoring and you should use it wherever it applies. A prompt that emits structured data should be graded against expected outputs, not against a judge's vibes.

For open-ended tasks (summaries, replies, rewrites, anything where two good responses can be worded completely differently) there is usually no single canonical answer, and pretending there is one is worse than admitting there isn't. Leave the expected-output column blank and grade with an LLM judge against a written rubric. We covered the mechanics of that in the LLM-as-a-judge post; the relevant point here is that the dataset doesn't owe you an expected output for every row. It owes you inputs that exercise the prompt's job.

A real dataset is usually mixed. Some rows have expected outputs and are graded exactly. Some rows don't and are graded by a judge. Some rows have both: an expected output for the part that's checkable, plus judge-scored metrics for the parts that aren't. Treat the columns as independent and let each row use the ones it needs.

Hold the dataset fixed

This is the rule that turns a pile of rows into a benchmark.

The dataset has one job: to be the same measuring stick this week as last week. If you change which rows are in it between runs, you have no way of knowing whether a score change reflects a better prompt or just an easier set of questions. Scores stop being comparable, the baseline becomes a lie, and the whole loop collapses into a feeling.

In practice, fixing the dataset means three things:

Version it. Treat additions and removals like commits. v3 of the dataset has these 120 rows; v4 has those 120 plus 18 new ones. When you compare run scores, compare them within a dataset version, not across.
Add rows; resist editing or deleting them. A row that exposes a real failure is exactly the row you most want to keep, even (especially) once the prompt has learned to handle it. The temptation to "clean up" the dataset by removing rows that pass is the temptation to delete your safety net. Don't.
Re-baseline when the dataset changes. When you do add a batch of rows, re-run the current prompt over the new version to set a fresh baseline. The number you measure future changes against has to be the number the current prompt scores on the dataset as it stands today, not as it stood three revisions ago.

The way to think about it: the prompt is one versioned artifact, the dataset is another, and a run is the meeting of the two. Both are first-class. Both have history. Both deserve to be inspectable months from now when somebody asks why a number moved.

A starter recipe

If you have nothing today and you want a dataset by the end of the afternoon, this is the shortest path:

Pull 50 real inputs from your production logs or the closest analogue you have. Random sample, no curation yet.
Walk the support backlog for the last quarter and grab every input that triggered a complaint about the AI feature. Add them as rows. Tag them so you know which ones came from incidents.
For each row, decide: is there a defensible expected output? If yes, write it in. If no, leave the column empty and plan to grade with a judge.
Skim what you've got and look for gaps in kinds of input: languages, lengths, tones, formats, adversarial cases. Add a handful of rows per gap, from real sources where you can, synthetic where you must, labelled as synthetic so future-you knows.
Run the current prompt over the dataset and write down the score. That's your baseline. Every change from here on has to clear it.
From now on, every customer-reported bug, every weird production output, every "the model said something dumb" Slack message becomes a row. The dataset grows by incidents, not by sprint planning.

That's the whole job. Real inputs, deliberate edge cases, mixed expected/judge scoring, held fixed across runs, growing only by incidents. A dataset like that is small, alive, and load-bearing, which is the only kind worth maintaining.

FAQ

How many rows should an LLM evaluation dataset have?

Start with 30–50 rows that cover the shapes of input your prompt actually sees, then grow it deliberately as you find new failure modes. A small dataset that is fixed, representative, and includes the edge cases that have already burned you is more useful than a thousand rows of near-duplicate happy paths.

Where do the inputs come from?

Production logs, support tickets, recordings of agent conversations, exports from the system the AI feature replaced. Synthetic inputs fill obvious gaps in coverage but they should not be the foundation. The dataset's job is to reflect how customers actually use the feature, including the awkward ways.

Should every row have an expected output?

Only when one exists. Closed-ended tasks (classification, extraction, valid-JSON checks) get expected outputs and you grade with exact-match or schema validation. Open-ended tasks (summaries, replies, rewrites) usually don't have one canonical answer, so you grade them with an LLM judge against a written rubric and skip the expected-output column.

Why does the dataset have to stay fixed?

Because the baseline is only meaningful relative to a stable measuring stick. If you swap rows in and out between runs, you can't tell whether a score change came from your prompt or from your dataset. Add rows in deliberate, named revisions; treat the dataset as a versioned artifact, the same way you treat the prompt.

Build a dataset free Read about LLM-as-a-judge

Built by Homemade Software. Disagree with any of this? support@completionkit.com or r/completionkit.

← All posts