May 15, 2026 · The CompletionKit team

LLM-as-a-judge: a practical guide to scoring prompt outputs

LLM-as-a-judge is the practice of using one language model to score the output of another. Instead of a human reading every response and rating it by hand, you hand the response (together with a rubric) to a model and ask it to grade. The judge returns a score and, if you ask for it, a sentence of rationale. That is the whole idea. Everything interesting is in the details of doing it well.

It exists because the obvious alternatives don't work for the outputs people actually ship. Exact-match comparison works when there's one correct string; it falls apart the moment the output is a summary, an answer, an extraction, a rewrite, anything where two good responses can be worded completely differently. Human grading works, and it's the gold standard, but it doesn't scale: nobody is going to read 200 responses every time you tweak a prompt, and if scoring is slow then scoring stops happening. LLM-as-a-judge is the thing in the middle. It's cheap enough to run on every change and consistent enough to be useful, as long as you respect what it is and isn't.

Where the idea came from

The practice got its name from a June 2023 paper, "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", out of the Berkeley group behind Chatbot Arena and the Vicuna model. They had a concrete problem: they were ranking chatbots head-to-head and needed to grade open-ended, multi-turn answers, and human voting couldn't keep up with the volume. So they asked whether a strong model could stand in for the crowd.

The clever part was the bar they set. They didn't ask "is the model judge correct?", an unanswerable question for open-ended text. They asked: does the judge agree with a human as often as two humans agree with each other? On this kind of task, human graders only agree with each other around 80% of the time; that's the ceiling, and it's nowhere near 100%. GPT-4, used as a judge, also hit over 80%, matching the human-to-human rate. That single result is what made LLM-as-a-judge credible: not that the judge is right in some absolute sense, but that it disagrees with you about as much as your own colleague would.

The same paper named the failure modes in the same breath (position bias, verbosity bias, and self-enhancement bias), so the method and its known flaws have been on the table together from the start. That's the right way to hold it: a useful instrument with a documented margin of error, not an oracle.

How it works

A judge is just another prompt. You give a model three things: the input that produced the response, the response itself, and a rubric that says what "good" means. You ask it to return a score on a fixed scale. In CompletionKit the scale is 1–5, where 5 is the best and 1 is the worst, and each run can carry several metrics so you score independent things independently.

That last point matters more than it looks. A single global "is this good?" score blends together everything you care about into one mushy number, and a mushy number can't tell you what regressed. Score the things you actually care about as separate criteria ("stays faithful to the source", "valid JSON", "acknowledges the order number", "never promises a refund"), and a drop in any one of them points straight at the problem. The metrics you judge on should be as specific as the things you put in the prompt, because those are exactly the things a prompt change can break.

A judge takes the input, the response, and a rubric, and returns a score with a short rationale.

Where it goes wrong

An LLM judge is a model, and it brings a model's failure modes with it. None of these are reasons to avoid LLM-as-a-judge. They're reasons to set it up deliberately.

Position bias. Asked to compare two responses, judges favour whichever one came first (or sometimes whichever came last) independent of quality. If you do pairwise comparison, run each pair both ways and average, or you're measuring order as much as quality.
Verbosity bias. Longer, more confident-sounding answers score higher even when they say nothing more. If brevity matters for your product, you have to name it in the rubric explicitly, because the judge's default instinct runs the other way.
Self-preference. A model tends to rate outputs from its own family more highly. Grading a GPT response with a GPT judge, or a Claude response with a Claude judge, quietly tilts the score. Where you can, judge with a different family than the one you're evaluating.
Score compression. Left to its own devices a judge clusters everything around 4 out of 5. A scale where nothing ever scores 2 isn't a scale. Anchoring each point of the rubric with a concrete description (what specifically earns a 2, what earns a 5) spreads the distribution back out.
Silent failure. This is the one that bites hardest, because it doesn't look like a failure. If the judge call truncates, or the model burns its whole token budget on hidden reasoning and returns an empty visible completion, a naive parser reads that empty string and coerces it to a score, usually the floor. Now a broken call looks like a genuine 1-star review. A wrong number is far worse than an error, because the error you'd investigate and the number you'd trust. We hit exactly this with reasoning models: their reasoning tokens count against the same budget as the answer, so a too-small limit produces empty content and a quiet floor of 1. The fix is to give the judge real token headroom and to treat truncation or an empty completion as a failed review (recorded, surfaced, and excluded from the average), never as a score.

Making it reliable

"Reliable" here doesn't mean the judge is correct in some absolute sense. It means it's consistent enough to rank prompt versions and catch regressions, which is what you actually need. A few things get you there:

Write a specific rubric. "Rate the quality 1–5" gives you noise. "Score 5 if every action item names an explicit owner; 3 if owners are implied; 1 if ownership is missing" gives you a signal. The rubric is the whole instrument: vague in, vague out.
Anchor every point of the scale. Describe what earns each score, not just the top and bottom. This is the single most effective fix for score compression.
Give a few worked examples. One or two responses with their correct scores and a sentence of why, included in the judge prompt, calibrate it far better than instructions alone.
Calibrate against humans once. Hand-label 20–50 responses, run the judge over the same set, and look at where they disagree. You're not chasing perfect agreement; you're checking the judge isn't systematically wrong somewhere. If it is, the disagreements tell you which rubric line to rewrite. In CompletionKit this is the judge agreement surface: tag rows agree, disagree, or borderline, and the trust panel shows you where the judge and your own labels part ways.
Use a capable judge and give it headroom. A judge weaker than the model it grades misses subtle errors. And whatever model you pick, leave enough token budget for it to finish, especially with reasoning models, where hidden reasoning eats the same budget as the answer.
Hold the rubric fixed. The rubric is your measuring stick. If you change it, every prior score is on a different scale and your baseline is meaningless. Version the rubric the way you version the prompt.

A worked example, from a real run

Here's the whole instrument assembled and fired, from a run in our own workspace. The metric is acknowledges-order-number, and the rubric anchors every point of the scale: 5, the reply quotes the customer's order number verbatim; 4, it references the specific order but doesn't quote the number; 3, it acknowledges an order exists; 2, generic acknowledgment with no reference to the order; 1, it ignores the order entirely.

One row's input was a two-line complaint about order #58219 arriving with the box crushed and the mug inside shattered. The generated reply opened with "I'm so sorry to hear about the delay and the damaged mug" and never mentioned the order number. The judge returned a 1, with this rationale: "The reply never references the customer's order or order number #58219, instead giving a completely generic response that ignores the specific order details provided."

That rationale line is the part you can't get from an assert. It doesn't just dock the score; it names the failure precisely enough that the fix writes itself. Across the eight-row dataset the prompt averaged 3.88, dragged down by exactly the rows where the reply went generic.

(Real run, June 12, 2026. Generation by gpt-4.1-mini, judged by claude-sonnet-4-6.)

The half of the loop you can run on outputs you already have

You don't always need to generate anything to use a judge. If you already have outputs (production logs, a competitor's responses, hand-curated examples, results from an older model) you can score them directly. In CompletionKit that's a judge-only run: drop the outputs into a column of a dataset, point the run at that column, define your metrics, and the judge scores every row against your rubric. Same scoring, same per-row rationale, no generation step. It's a fast way to benchmark something you didn't build, or to put a number on how your live traffic is actually doing.

However you use it, the point of a judge is the same: turn "this output feels fine" into a number you can compare to last week's number. That's the difference between hoping a prompt change helped and knowing it did. And once the judge is one you trust, you don't have to drive the loop by hand either: point a coding agent at the MCP server and let it climb the score for you.

FAQ

What is LLM-as-a-judge?

Using one language model to score the output of another against a written rubric. The response and the grading criteria go to a model, which returns a score and a short rationale. It's used for open-ended outputs (summaries, answers, extractions) where exact-match checks don't work.

Is LLM-as-a-judge reliable?

Reliable enough to rank prompt versions and catch regressions when it's set up carefully: a specific rubric, a 1–5 scale with anchored points, a capable judge model, and a one-time calibration against human labels. It is not a reliable absolute measure of quality, and it carries known biases you design around rather than remove.

Which model should I use as the judge?

A capable frontier model with enough token headroom to finish reasoning and still emit a full response. A judge weaker than the model being graded misses subtle errors. Avoid grading a model with a judge from its own family, because of self-preference bias.

What is a judge-only run?

A run that scores outputs you already have (production logs, hand-curated examples, another system's results) with no generation step. You point the run at the column holding the outputs, define your metrics, and the judge scores each row.

Put a judge on your outputs free Why we built CompletionKit

Built by Homemade Software. Disagree with any of this? support@completionkit.com or r/completionkit.

← All posts