June 1, 2026 ·
What is a judge-only run?
A judge-only run is a scored evaluation of outputs you already have, with no generation step. You drop the outputs into a column of a dataset, point the run at that column, define your metrics, and an LLM judge scores every row against your rubric. Nothing is generated, no prompt is invoked; the run does the second half of an evaluation (the scoring half) on responses produced somewhere else.
The point is that scoring and generating are separable. A normal run does both: it executes a prompt over each input in a dataset and then scores the output. A judge-only run does only the second step, on outputs that came from anywhere: a sample of last week's production traffic, a hand-curated gold set, a competitor's API, an older model you're about to retire, a teammate's spreadsheet of "responses I want a number on". Same rubric, same per-row rationale, same comparable score. No generation.
For the wider picture of what a "run" and a "metric" mean, start with what is prompt evaluation?. For the scoring half (how an LLM judge actually grades a response), the LLM-as-a-judge guide is the companion piece. This post is about the run shape that drops the generation step entirely.
Where the idea came from
Separating "score the outputs" from "produce the outputs" is older than LLMs by half a century, and the lineage is worth tracing because every prompt-eval move available today inherits from it.
The first systematic version is the Cranfield experiments, run by Cyril Cleverdon and his team at the College of Aeronautics at Cranfield, England, between 1958 and 1966. Cleverdon's group was trying to compare indexing systems for libraries, and they did something that looks obvious in hindsight and was novel at the time: they built a fixed test collection (Cranfield 2 ended up with 1,400 documents and 225 queries), gathered the results of every indexing method on those queries, and then had assessors judge the relevance of each retrieved document, separately from the systems that retrieved them. The assessors were not running any indexing system. They were doing the judging half, on a pile of outputs the systems had already produced. Recall and precision, the two metrics every search engine has been graded on since, fell out of that setup. The methodology has a name, the Cranfield paradigm, and it is the direct ancestor of every offline eval you have ever run.
The pattern got industrial-scale infrastructure in November 1992, when NIST and the Intelligence Advanced Research Projects Activity launched the Text Retrieval Conference (TREC). TREC-1 attracted 28 groups, each running their own retrieval system on a shared corpus and submitting their top-ranked documents back to NIST. NIST then used a method called pooled relevance judgments: collect the top-N documents from every participating system, deduplicate, hand the pool to human assessors, and grade every document in the pool once. Every system's score then comes out of those same judgments. The assessors at NIST never ran any of the search engines. They judged outputs.
The same move shows up in machine translation. In July 2002 at ACL in Philadelphia, Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu of IBM Research published BLEU, a method for scoring a translation by comparing its n-grams against one or more reference translations. BLEU doesn't translate anything. It takes a candidate output, takes references that already exist, and returns a number. Twenty-four years on, with close to twenty thousand citations, BLEU is the single most-cited "score outputs you already have" tool in the history of language technology.
A judge-only LLM run is the latest expression of the same shape. The judge replaces the human assessor (or the n-gram comparator), the rubric replaces the relevance scale, the dataset is the column of outputs and (often) their inputs. What's new is the speed and the open-endedness: the judge can grade a summary or an open chat reply on a written rubric in seconds, where Cleverdon's assessors took months and BLEU only worked on tasks with reference translations. The shape underneath is the same one Cleverdon used in 1962.
What you need to set one up
Three things, and they map exactly onto a normal run minus the prompt.
- A dataset with an output column. Each row holds at least the response you want graded. If your rubric needs context (the source document for a faithfulness check, the user's message for a tone check, the question for an answer check), include those columns too. The dataset is just the table of stuff the judge is going to read.
- A rubric, written as metrics. Same shape as for a normal run: one named criterion per metric, each scored independently on a 1–5 scale, each anchored with what specifically earns each point. The rubric is the entire instrument; vague in, vague out. LLM-as-a-judge covers how to write one that produces signal instead of noise.
- A judge model with headroom. A capable frontier model, with enough token budget to finish reasoning and still emit a full response. The biases (position, verbosity, self-preference, score compression, silent failure on a truncated call) apply to judge-only runs exactly the same way they apply to normal runs. The judge is the judge regardless of where the outputs came from.
The CompletionKit docs walk through pointing a run at an output column and configuring its rubric. Once that's set, the run shape is the same as any other: kick it off, watch each row get scored, get a per-row rationale alongside the number.
The four shapes a judge-only run actually takes
In practice almost every judge-only run is one of four jobs.
1. Grading a sample of live production traffic
You already have outputs flowing every day; the question is whether the average quality is steady, climbing, or quietly falling. Sample a few hundred production responses a week, write them into a dataset, and run the judge over them. The score lands in a row alongside the response, and the lowest-scoring rows are the ones a human should actually look at. This is the version most teams reach for first because production traffic is the cheapest source of real outputs in the world, and most of it is otherwise scrolling past unscored.
The thing to remember is that production grading is not the same exercise as prompt regression testing. Regression testing holds the inputs fixed across runs so the score is comparable; production grading holds the prompt fixed and varies the inputs. Both are useful and they answer different questions. A drop in your fixed-dataset regression score means a prompt got worse. A drop in your production-sample judge-only score means something in the real world got harder, which is also worth knowing.
2. Putting a number on a hand-curated gold set
If you have a small set of outputs you consider exemplary (the responses you would ship as the bar for "great"), a judge-only run is how you confirm the rubric agrees with you. If the rubric scores a gold-set response a 5, your rubric is calibrated. If it scores it a 3, either the rubric is wrong or the response isn't as gold as you thought; both are useful findings, and the time to discover them is before the rubric is grading any real prompt versions. Treat the gold set as a sanity check that travels alongside the dataset.
3. Benchmarking something you didn't build
A competitor's API, an open-source model, the in-house system you're evaluating before adopting it. Run the inputs through them, paste the outputs into your dataset, kick off the judge-only run. You now have a number on a system you have no other handle on. This is also how to evaluate a third-party tool against your real workload: the system did not have to be built by you for its outputs to be gradeable by your rubric.
4. Baselining an older model before you replace it
You're swapping from one model to another, or one provider to another, and you want to know what you are giving up and gaining. Score the old model's existing outputs as a judge-only run; then run the new model on the same inputs and score those. The same rubric, scored by the same judge, on outputs from both. The difference is your migration's expected impact, in numbers, before you ship anything. This is the cleanest answer there is to "did the model swap help or hurt".
Where it gets tricky
A judge-only run inherits every pitfall of the judge that drives it, and adds two of its own.
- Sampling bias becomes a result, not a footnote. If the outputs in your dataset are a biased sample (the responses your team noticed, the support tickets that escalated, the chat sessions a customer flagged), the score reflects that bias, not your overall quality. Sampling on purpose is fine. Sampling by accident, and then reading the score as though it represents the average, is the most common way to mislead yourself with a judge-only run. Pin down what your sample is supposed to represent before you read the number.
- You can't disentangle generator and judge anymore. When you run your own prompt with your own judge you can iterate on either side. When you grade outputs from a third-party system you cannot fix the system's generator; you can only narrow your rubric or change the judge. That's a different kind of conversation, and it's worth being explicit that what you're producing is a measurement, not a diagnostic. Save the diagnostic work for systems you can change.
- The judge still needs context to be fair. An LLM judge with only the output and no input is grading text in a vacuum. It will catch tone, fluency, internal consistency, JSON validity. It will miss "this answer is wrong for what was asked", because it never saw what was asked. The fix is to give the judge whatever the rubric depends on. If your rubric checks faithfulness to a source document, the source document goes in a column. If it checks relevance to a user's message, the user's message goes in a column. The judge is only as fair as the context it gets.
- An empty output is a failed review, not a 1. This one bites hardest because it doesn't look like a failure: the judge call truncates, or the reasoning model burns its whole token budget on hidden reasoning and emits an empty visible completion, and a naive parser coerces that empty string to the floor. The same trap covered in the LLM-as-a-judge piece: give the judge real token headroom and treat truncation or empty content as a failed review (logged, surfaced, excluded from the average), never as a score.
The shape of the question it answers
A normal run answers: how does this prompt do on this dataset? A judge-only run answers: how do these outputs do, against this rubric? The shift is from "did my change help" to "is what I'm looking at any good", and a surprising number of real product questions are the second kind.
Is the model my agent picked up last quarter still pulling its weight? Are the responses we generated in our beta period acceptable to ship as examples? Is the open-source baseline as close to ours as the leaderboard says? Did the new vendor's demo actually answer the questions we sent? None of those need a prompt of your own to be in the loop. All of them need a rubric, a judge, and a column of outputs.
That is the whole job. Put a column of outputs in front of a judge, write down what good means, and read the score.
FAQ
What is a judge-only run?
A run that scores outputs you already have, with no generation step. You drop the outputs into a column of a dataset, point the run at that column, define your metrics, and an LLM judge scores every row against your rubric. The inputs to the prompt that produced those outputs can be in the dataset too, so the judge sees what was asked as well as what was answered.
When should I use a judge-only run instead of a normal run?
Use a judge-only run when the outputs already exist and you only want a number on them. Common cases: scoring a sample of live production traffic, grading a hand-curated gold set, benchmarking another system you didn't build, or putting a baseline number on an older model's results before you replace it. Use a normal run when you want to score a specific prompt version against the dataset, because that's the loop a normal run is built for.
Do judge-only runs need the original input, or just the output?
They need whatever the rubric needs to grade fairly. For a faithfulness check the judge has to see the source document; for a tone check the original user message; for a valid-JSON check the output alone is enough. The safe default is to include both the input and the output in the dataset and let the rubric ignore what it doesn't need.
Can a judge-only run replace human review of production traffic?
It can replace the routine grading, not the spot check. A judge-only run scales sampling: it lets you put a number on a thousand production responses a week instead of a hand-graded dozen. Humans are still the calibration set behind the judge, the people who write the rubric, and the people who look at the judge's lowest-scoring rows to confirm the failure is real.
Score your outputs free Read about LLM-as-a-judge
Built by Homemade Software. Disagree with any of this? support@completionkit.com.