June 1, 2026 · The CompletionKit team

What is a judge-only run?

A judge-only run is a scored evaluation of outputs you already have, with no generation step. You drop the outputs into a column of a dataset, point the run at that column, define your metrics, and an LLM judge scores every row against your rubric. Nothing is generated, no prompt is invoked; the run does the second half of an evaluation (the scoring half) on responses produced somewhere else. In the CompletionKit UI this is the Score existing outputs option when you set up a run.

The point is that scoring and generating are separable. A normal run does both: it executes a prompt over each input in a dataset and then scores the output. A judge-only run does only the second step, on outputs that came from anywhere: a sample of last week's production traffic, a hand-curated gold set, a competitor's API, an older model you're about to retire, a teammate's spreadsheet of "responses I want a number on". Same rubric, same per-row rationale, same comparable score. No generation.

For the wider picture of what a "run" and a "metric" mean, start with what is prompt evaluation?. For the scoring half (how an LLM judge actually grades a response), the LLM-as-a-judge guide is the companion piece. This post is about the run shape that drops the generation step entirely.

An idea older than the LLM

The move underneath a judge-only run is about sixty years older than the LLM. In the early 1960s, Cyril Cleverdon and his team at the College of Aeronautics in Cranfield, England, set out to settle which library indexing system was actually better, and they did something quietly radical: rather than pit the systems against each other live, they froze a set of documents and questions, collected what every system returned, and handed the pile to assessors who graded each result for relevance without operating a single system. Recall and precision, the two numbers search has been judged on ever since, fell out of that setup. A judge-only run is the same move with the parts swapped: the LLM judge takes the assessor's chair, your rubric stands in for their relevance scale, and a dataset column holds the pile of outputs already waiting to be scored.

A normal run executes a prompt, then scores what comes out. A judge-only run skips generation: the input and the output already sit in the dataset as columns, and both feed the judge, which scores each row.

What you need to set one up

Three things, and they map exactly onto a normal run minus the prompt.

A dataset with an output column. Each row holds at least the response you want graded. If your rubric needs context (the source document for a faithfulness check, the user's message for a tone check, the question for an answer check), include those columns too. The dataset is just the table of stuff the judge is going to read.
A rubric, written as metrics. Same shape as for a normal run: one named criterion per metric, each scored independently on a 1–5 scale, each anchored with what specifically earns each point. The rubric is the entire instrument; vague in, vague out. LLM-as-a-judge covers how to write one that produces signal instead of noise.
A judge model with headroom. A capable frontier model, with enough token budget to finish reasoning and still emit a full response. The biases (position, verbosity, self-preference, score compression, silent failure on a truncated call) apply to judge-only runs exactly the same way they apply to normal runs. The judge is the judge regardless of where the outputs came from.

Setting one up in CompletionKit is four moves, all in the UI. First, create a dataset and paste in your CSV, with the responses you want graded in a column (CompletionKit reads one named actual_output by default) next to any input or context columns the rubric needs to grade fairly. Second, create a run and tick Score existing outputs: the prompt selector drops away and an Output column field takes its place, where you name the column holding your responses. Third, pick a judge model and check off the metrics that make up your rubric. Fourth, start it. The responses ride in as a column of the dataset, so there is no separate prompt and no generation step; the run reads that column as the response and scores it. From there it behaves like any other run: each row gets a score and a per-row rationale, and the lowest-scoring rows are the ones to read first. The docs cover the same path in more detail.

The four shapes a judge-only run actually takes

In practice almost every judge-only run is one of four jobs.

1. Grading a sample of live production traffic

You already have outputs flowing every day; the question is whether the average quality is steady, climbing, or quietly falling. Sample a few hundred production responses a week, write them into a dataset, and run the judge over them. The score lands in a row alongside the response, and the lowest-scoring rows are the ones a human should actually look at. This is the version most teams reach for first because production traffic is the cheapest source of real outputs in the world, and most of it is otherwise scrolling past unscored.

The thing to remember is that production grading is not the same exercise as prompt regression testing. Regression testing holds the inputs fixed across runs so the score is comparable; production grading holds the prompt fixed and varies the inputs. Both are useful and they answer different questions. A drop in your fixed-dataset regression score means a prompt got worse. A drop in your production-sample judge-only score means something in the real world got harder, which is also worth knowing.

2. Putting a number on a hand-curated gold set

If you have a small set of outputs you consider exemplary (the responses you would ship as the bar for "great"), a judge-only run is how you confirm the rubric agrees with you. If the rubric scores a gold-set response a 5, your rubric is calibrated. If it scores it a 3, either the rubric is wrong or the response isn't as gold as you thought; both are useful findings, and the time to discover them is before the rubric is grading any real prompt versions. Treat the gold set as a sanity check that travels alongside the dataset.

3. Benchmarking something you didn't build

A competitor's API, an open-source model, the in-house system you're evaluating before adopting it. Run the inputs through them, paste the outputs into your dataset, kick off the judge-only run. You now have a number on a system you have no other handle on. This is also how to evaluate a third-party tool against your real workload: the system did not have to be built by you for its outputs to be gradeable by your rubric.

4. Baselining an older model before you replace it

You're swapping from one model to another, or one provider to another, and you want to know what you are giving up and gaining. Score the old model's existing outputs as a judge-only run; then run the new model on the same inputs and score those. The same rubric, scored by the same judge, on outputs from both. The difference is your migration's expected impact, in numbers, before you ship anything. This is the cleanest answer there is to "did the model swap help or hurt".

Where it gets tricky

A judge-only run inherits every pitfall of the judge that drives it, and adds two of its own.

Sampling bias becomes a result, not a footnote. If the outputs in your dataset are a biased sample (the responses your team noticed, the support tickets that escalated, the chat sessions a customer flagged), the score reflects that bias, not your overall quality. Sampling on purpose is fine. Sampling by accident, and then reading the score as though it represents the average, is the most common way to mislead yourself with a judge-only run. Pin down what your sample is supposed to represent before you read the number.
You can't disentangle generator and judge anymore. When you run your own prompt with your own judge you can iterate on either side. When you grade outputs from a third-party system you cannot fix the system's generator; you can only narrow your rubric or change the judge. That's a different kind of conversation, and it's worth being explicit that what you're producing is a measurement, not a diagnostic. Save the diagnostic work for systems you can change.
The judge still needs context to be fair. An LLM judge with only the output and no input is grading text in a vacuum. It will catch tone, fluency, internal consistency, JSON validity. It will miss "this answer is wrong for what was asked", because it never saw what was asked. The fix is to give the judge whatever the rubric depends on. If your rubric checks faithfulness to a source document, the source document goes in a column. If it checks relevance to a user's message, the user's message goes in a column. The judge is only as fair as the context it gets.
An empty output is a failed review, not a 1. This one bites hardest because it doesn't look like a failure: the judge call truncates, or the reasoning model burns its whole token budget on hidden reasoning and emits an empty visible completion, and a naive parser coerces that empty string to the floor. The same trap covered in the LLM-as-a-judge piece: give the judge real token headroom and treat truncation or empty content as a failed review (logged, surfaced, excluded from the average), never as a score.

The shape of the question it answers

A normal run answers: how does this prompt do on this dataset? A judge-only run answers: how do these outputs do, against this rubric? The shift is from "did my change help" to "is what I'm looking at any good", and a surprising number of real product questions are the second kind.

Is the model my agent picked up last quarter still pulling its weight? Are the responses we generated in our beta period acceptable to ship as examples? Is the open-source baseline as close to ours as the leaderboard says? Did the new vendor's demo actually answer the questions we sent? None of those need a prompt of your own to be in the loop. All of them need a rubric, a judge, and a column of outputs.

That is the whole job. Put a column of outputs in front of a judge, write down what good means, and read the score.

FAQ

What is a judge-only run?

A run that scores outputs you already have, with no generation step. You drop the outputs into a column of a dataset, point the run at that column, define your metrics, and an LLM judge scores every row against your rubric. The inputs to the prompt that produced those outputs can be in the dataset too, so the judge sees what was asked as well as what was answered.

When should I use a judge-only run instead of a normal run?

Use a judge-only run when the outputs already exist and you only want a number on them. Common cases: scoring a sample of live production traffic, grading a hand-curated gold set, benchmarking another system you didn't build, or putting a baseline number on an older model's results before you replace it. Use a normal run when you want to score a specific prompt version against the dataset, because that's the loop a normal run is built for.

Do judge-only runs need the original input, or just the output?

They need whatever the rubric needs to grade fairly. For a faithfulness check the judge has to see the source document; for a tone check the original user message; for a valid-JSON check the output alone is enough. The safe default is to include both the input and the output in the dataset and let the rubric ignore what it doesn't need.

Can a judge-only run replace human review of production traffic?

It can replace the routine grading, not the spot check. A judge-only run scales sampling: it lets you put a number on a thousand production responses a week instead of a hand-graded dozen. Humans are still the calibration set behind the judge, the people who write the rubric, and the people who look at the judge's lowest-scoring rows to confirm the failure is real.

Score your outputs free Read about LLM-as-a-judge

Built by Homemade Software. Disagree with any of this? support@completionkit.com.

← All posts