June 13, 2026 · The CompletionKit team

Let your coding agent improve the judge itself

Q: How many human verdicts do I need before this is worth running?

judges_compare needs at least 30 combined verdicts across the two versions before it will return recommend or hold; below that it returns need_more_data. The same threshold is a sensible floor in practice. Twenty borderline-disagree rows from a single run is usually enough to start; a hundred across a few weeks of grading is comfortable.

Q: What stops the agent from optimizing the judge to please itself?

The verdicts. judges_compare scores each candidate against the human-recorded agree, disagree, and borderline labels, not against the model's own behaviour. A rewrite that scores well only because the new judge agrees with the old judge cannot move the agreement number; only matching the human verdicts can. Keep recording verdicts and the measuring stick stays honest.

If a coding agent already runs the prompt loop for you, it can run the loop on the judge too: read where the judge disagrees with humans, draft rubric rewrites aimed at those disagreements, score the rewrites against the same human verdicts, and promote a rewrite only when it agrees more than the current version did. Three tools on the CompletionKit MCP server make that close: agreements_list surfaces the disagreements, metrics_suggest_variants drafts candidate metric versions targeted at them, and judges_compare returns a recommend or hold verdict by comparing the two versions' agreement statistics on your stored verdicts.

The earlier post on the automated eval loop covered the agent improving the prompt. This one is its sibling. The agent edits a different artifact (the rubric, not the prompt) and climbs a different score (agreement with humans, not the judge's own number), but the shape of the work is the same.

Why the judge needs a loop of its own

The prompt loop climbs whatever hill the judge defined. If the judge is a little off, marking down a row a human would mark up, or shrugging at a row a human would fail, the agent will happily optimize the prompt away from that case. The score on your dashboard goes up. The product gets worse. We have seen exactly this in our own runs: a prompt that scores well on a verbosity-blind judge will lengthen until every answer is a paragraph.

The fix is not to throw out LLM-as-a-judge. It is to do for the judge what you already do for the prompt: keep a fixed measuring stick (in this case, recorded human verdicts), make small edits, and only ship an edit that does better on the measuring stick. The agreement number is the measuring stick the judge optimizes against, the same way the eval score is the measuring stick the prompt optimizes against. Two loops, one shape.

Grading the graders

The science of grading the graders is old. The reference point is Jacob Cohen's 1960 paper "A Coefficient of Agreement for Nominal Scales", published in Educational and Psychological Measurement, which introduced what we now call Cohen's kappa: a number from -1 to 1 that says how much two raters agree on a categorical judgement, corrected for how often they would have agreed by chance. The point was that "we both said 5" is not impressive if both raters say 5 most of the time anyway. Cohen followed up in 1968 with a weighted version, so that a 5-versus-4 disagreement on an ordinal scale is treated as a smaller miss than a 5-versus-1 disagreement. CompletionKit uses the quadratic-weighted form when it summarises judge-versus-human verdicts on the 1–5 rubric scale.

The Berkeley MT-Bench paper from June 2023 ported that whole idea to language models, with one sharp move: a judge is credible to the extent it agrees with a human as often as two humans do. The earlier post on LLM-as-a-judge walks through the result if it is new to you. What this loop does is take that same agreement number and turn it into a feedback signal for the rubric itself. Write a new version, score it against the recorded verdicts, keep it only if the agreement and kappa improve over the current version. That is what judges_compare automates.

The three tools by name

The MCP server exposes the full toolbox (the setup guide lists every resource family and links to the canonical API reference). Three of those tools are the ones that close this loop, and they are the ones an agent calls in order.

agreements_list returns the per-row verdicts a human recorded against the judge: agree, disagree, or borderline, optionally with a corrected score and a note. Filter by metric_id and the result is every row where you, or another operator on the org, said the judge was right or wrong on that metric. This is the input the rewrite has to address. A rewrite that does not look at the disagreements is just a reword.
metrics_suggest_variants asks a model to draft up to three variants of the metric's judge instruction, targeted at the recent disagreements. Each variant is saved as a draft MetricVersion with source: "suggestion", so it is addressable, reviewable, and easy to dismiss. The tool documents this directly: "One focused rewrite beats five reworded copies."
judges_compare takes two metric versions and returns the agreement statistics for each (agreement rate with confidence interval, mean absolute error, quadratic-weighted kappa, sample size), the deltas between them, and a structured recommendation: recommend, hold, no_change, or need_more_data. The thresholds are conservative on purpose: combined sample size below 30 returns need_more_data, and the lift in agreement has to clear 3 percentage points before it is called a recommend.

A recommend from judges_compare is the loop's stop condition. When the agent gets one, it calls metric_versions_publish on the candidate and hands you the result; a hold or no_change sends it back to draft another variant or, after a few tries, to give up and ask. need_more_data means stop and grade more rows: there is no shortcut around that one.

A brief you can hand the agent

The brief is short on purpose. The point is to name the bar and the stop condition, then walk away.

Improve the judge on metric "acknowledges-order-number" (metric_id 42). Read the recent disagreements with agreements_list. Use metrics_suggest_variants to draft up to three candidate metric versions targeted at them. For each candidate, call judges_compare against the current version. If any candidate returns recommend, publish it with metric_versions_publish and stop. If all candidates return hold or no_change, stop and report which one came closest. If any candidate returns need_more_data, stop and tell me how many more verdicts are needed.

Three things are doing the work in that brief. The metric is named, so the agent cannot wander. The stop condition is built out of judges_compare's own verdict types, so there is no fuzzy "good enough" the agent gets to interpret. And publishing only happens on recommend, so the worst case is an unpublished draft you can read, not a regression in production.

Four moves in order. The agent only publishes on recommend, and only once it has 30 combined verdicts to compare against.

What this is not

It is worth being precise. This loop is not "the judge tunes itself". The judge is graded against verdicts a human recorded, not against its own outputs, so a rewrite cannot win by drifting toward the model's preferred style. The candidate has to match the labels in agreements_list, and those labels do not move when the judge does.

It is also not a substitute for keeping the rubric versioned and small. judges_compare's recommend is a green light at the rubric level; it does not stop you from publishing a rubric that is internally inconsistent, or one that drifts so far from the original metric that prior scores stop being comparable. The same discipline from prompt versioning ports across: keep the change small enough that the diff is readable, and keep the addressable history so reverting is one publish away.

Where it goes wrong

Thin sample sizes. Below 30 combined verdicts, judges_compare returns need_more_data on purpose, because below that the agreement rate is mostly noise. The fix is not to lower the threshold; it is to record more verdicts. A scheduled fifteen-minute grading session against the most recent run beats a heroic backfill nobody does.
Drifting human verdicts. If the people grading change their mind about what good means (the rubric in their heads moves), the measuring stick moves with them, and a recommend from judges_compare just means the judge tracked the drift. Re-grading a small fixed set of rows every few weeks against the current rubric is how you catch this.
Treating need_more_data as a failure. It is the most useful verdict the tool returns. It says the work is real, and it tells you where the loop is actually bottlenecked: not on the judge, on the labels.

The point

Two loops, one shape. One climbs the score by editing the prompt. The other climbs the agreement number by editing the rubric. Run the second one occasionally and the first one stays honest. Skip it for long enough and the dashboard goes up while the product goes the other way.

FAQ

How does a coding agent improve an LLM judge?

Over MCP, the agent reads the rows where the judge disagrees with stored human verdicts, asks a model to draft rubric rewrites aimed at those disagreements, and then checks whether the rewrite agrees with the same verdicts more than the current version does. In CompletionKit the three tools are agreements_list (read disagreements), metrics_suggest_variants (draft candidate versions), and judges_compare (recommend or hold). The agent promotes a candidate only when judges_compare returns recommend.

Why bother improving the judge if I have a good prompt loop?

Because the prompt loop climbs whatever hill the judge defined. If the judge marks down a row that a human would mark up, the agent will optimize the prompt away from that case, and the score will rise while the product gets worse. Rewriting the judge against stored verdicts is how you confirm that the hill you are climbing is the one you wanted.

How many human verdicts do I need before this is worth running?

judges_compare needs at least 30 combined verdicts across the two versions before it will return recommend or hold; below that it returns need_more_data. The same threshold is a sensible floor in practice. Twenty borderline-disagree rows from a single run is usually enough to start; a hundred across a few weeks of grading is comfortable.

What stops the agent from optimizing the judge to please itself?

The verdicts. judges_compare scores each candidate against the human-recorded agree, disagree, and borderline labels, not against the model's own behaviour. A rewrite that scores well only because the new judge agrees with the old judge cannot move the agreement number; only matching the human verdicts can. Keep recording verdicts and the measuring stick stays honest.

Run the judge loop free Read the API reference

Built by Homemade Software. Disagree with any of this? support@completionkit.com or r/completionkit.

← All posts