Glossary

Definitions for the concepts CompletionKit is built on. Each one expands on a blog post; this page is the short version, in one place, for when you want a quick anchor.

Prompt evaluation: Scoring a prompt against a fixed dataset of inputs before you ship it, to know whether a change improved the prompt or regressed it.; Prompt evaluation is the practice of running a prompt against a held-out set of real inputs, scoring each output against the criteria you actually care about, and comparing the scores to a previous run. The point is the comparison: any change to the prompt either clears the previous baseline or doesn't. Without it, you ship a change and find out from the next customer screenshot. The closest analogue from traditional software is unit testing, except the things you're checking are tone, factual accuracy, instruction-following, and format compliance rather than equality.; Read the deep dive →
LLM judge: A language model used to score another language model's output against a rubric. The fastest way to get a number for an open-ended response.; An LLM judge is a language model that grades responses from another model. You give it a rubric (a 1-5 scale with verbal descriptions of what each score looks like) and a response, and it returns the score plus a written rationale. LLM judges are reliable on tone, format compliance, and instruction-following; weaker on precise factual recall and math. The pattern was formalised in the June 2023 Berkeley paper 'Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena' (arxiv.org/abs/2306.05685), which documented ~80% agreement with human raters on the chat-quality task it studied.; Read the deep dive →
Prompt versioning: Treating each change to a prompt as a new immutable revision with a stable identifier, a parent, and a readable diff, the same way you treat source code.; Prompt versioning is treating a prompt the way you treat source code: every change creates a new immutable revision with a stable identifier, a parent revision it descended from, and a readable diff. You can fetch a specific version, compare two versions, and roll back without rewriting history. The mechanics come from a 50-year lineage in source control (SCCS in 1972, RCS in 1982, Git in 2005); the move is to apply the same discipline to the prompt string your model reads.; Read the deep dive →
Prompt regression testing: Re-running a fixed evaluation dataset against every prompt change and rejecting any change that drops the average below a baseline.; Prompt regression testing is the LLM-era equivalent of a regression test suite. You hold a dataset of real inputs fixed, you re-run the prompt against it on every edit, and you reject any change that drops the per-row scores below the previous best. The dataset is the bar; the eval is the gate. Without it, fixing one input silently breaks twelve others, and you find out from customer screenshots over the next quarter.; Read the deep dive →
Evaluation dataset: A fixed set of real inputs (and optionally expected outputs) you score a prompt against. The bar a change has to clear.; An evaluation dataset is a fixed list of inputs you run a prompt against, optionally paired with expected outputs or a judge column. 'Fixed' matters: if the dataset moves under you, the baseline means nothing. Real inputs from production logs and support tickets beat synthetic ones because regressions tend to live in the long tail. Size matters less than coverage; 30 well-chosen rows hitting the edge cases you actually care about beats 1,000 generic ones.; Read the deep dive →
Metric: A scoring criterion you define for one quality of the output. Either a 1-5 rubric the judge applies to every row, or a deterministic check that passes or fails with no model call.; A metric in CompletionKit is one quality you care about. Most are a rubric: you describe what each 1-5 score looks like and the judge applies it to every response in a run, returning a score plus a rationale. Some are deterministic checks (valid JSON, contains a required token, matches a regex, no refusal) that pass or fail exactly, with no LLM call. You'll typically have three to five metrics per prompt, each catching a different failure mode. Specific beats general: 'Output acknowledges the customer's order number' is a better metric than 'is the response helpful'.
Judge calibration: The practice of tagging the judge's scores as agree / disagree / borderline so you can measure whether the judge is grading the way you actually want.; Judge calibration is how you measure whether the LLM judge's scoring matches yours. You review the judge's per-row verdicts and mark them agree, disagree, or borderline. A trust panel surfaces the resulting Wilson confidence interval and the quadratic-weighted kappa between you and the judge. Disagreements drive concrete rubric rewrites to tighten it; the trust score tells you whether to act on the judge's numbers yet.
Automated prompt optimization loop: Letting an agent (Claude Code, Cursor, anything MCP-speaking) drive the eval-revise-rerun loop on its own until the score clears a bar you set.; An automated prompt optimization loop is what you get when a coding agent drives the eval loop instead of you: draft a prompt revision, run it against the dataset, read the scores and per-row rationales, revise, repeat. You bring the rubric and the target score; the agent does the iteration. The shape is hill climbing on a prompt-scored landscape; the connector that made it tractable is the Model Context Protocol (open-sourced by Anthropic on November 25, 2024).; Read the deep dive →