00Overview
Tests for prompts. The actual ones, with scores.
A quick orientation: what CompletionKit does, the four moves of the loop, and where to start.
CompletionKit runs every prompt against real data, scores every output with an LLM judge against criteria you define, lets you calibrate that judge against your own taste, and proposes improvements grounded in the cases that scored low. Change anything (the prompt, the model, the temperature, the dataset), re-run, and see exactly what got better and what broke.
The four moves: build the prompt and the rubric, run it against your dataset, trust the judge by calibrating it against your own verdicts, and improve the prompt with AI-suggested edits grounded in the cases that scored low. You set what good means; the loop does the bench-pressing.
01Build
Write the prompt. Build the dataset. Define what good means.
The three artifacts you need before you can run anything.
Write a prompt template with {{placeholders}} for the variable inputs. Pick a model. Upload a real dataset where the column headers match the placeholders. CompletionKit merges them, one prompt per row.
Pick the metrics that catch what matters for your output. The five most universal ones (Correctness, Instruction following, Format compliance, Tone, Conciseness) ship as a metric starter pack on the metrics page: one click adopts it, and you can edit the 1-to-5 rubric bands before it goes live. The rubric is what an LLM judge will apply to every response, so band wording matters: "1: factually wrong" is a better cue than "1: bad".
Already have outputs from somewhere else (production logs, a different model, hand-written examples)? Skip the prompt and point the judge at the column that holds them. Same scoring, no generation step. (See judge-only run in Concepts.)
02Run
Execute. Score. Drill into the rationale.
Three run shapes for three situations.
A regular run generates a response for every dataset row and scores each one with the LLM judge against every metric on the run. You see the per-row scores, the per-metric averages, and a written rationale on every score. Click any response to see the input, the output, and the judge's reasoning side by side.
A judge-only run grades outputs that already exist without generating anything. Useful for production logs, another model's results, or hand-curated examples.
A regrade-only run re-judges an existing run's responses against the current metric version. Useful when you've tightened the rubric and want to see what scores change without re-spending on generation.
Every run is reproducible: prompt version, dataset, model, and metrics are pinned. Comparing two runs is the same as diffing two sets of per-row scores. The numbers tell you what got better; the rationales tell you why.
03Trust
An LLM judge is only useful if you trust its scores.
Calibration is how you measure that.
Click into any review and mark it agree, disagree, or borderline. The verdicts roll up into the trust panel on the metric page: a Wilson confidence interval that tells you how much signal you have, and a quadratic-weighted kappa that tells you how aligned you and the judge are on a 1-to-5 ordinal scale.
Cases you marked disagree get promoted to Cases to learn from, a queue you address by tightening the rubric on the metric page so future runs catch the same thing. The auto-suggest improvement loop reads those cases when it drafts a rubric rewrite.
Calibration runs against the metric's current version. Edit the rubric, publish a new metric version, and the calibration signal scopes to that new version. Old reviews carry a stale-version chip so you can tell which scores were judged under the old rubric and which under the new.
04Improve
Move the score. AI suggestion, or hand it to an agent.
Two paths to improvement, both grounded in your real data.
When the trust panel shows alignment but the score is low, the prompt has work to do. CompletionKit looks at the responses that scored low and proposes a concrete edit, grounded in the judge's actual rationale on your data rather than generic "be more specific" advice. The suggestion lands on the metric page as an inline draft with a per-band side-by-side diff; you accept what you want and edit what you don't.
When the same job feels mechanical, hand it to a coding agent. Every move in the loop (create a prompt revision, kick off a run, read the scores and per-row rationales, propose a new version, run again) is also an HTTP call and an MCP tool. Point Claude Code or Cursor at your org's MCP endpoint with the bearer key from Settings → API and the agent drives the loop on its own. A typical run looks like v1 3.4 → v2 4.3 → v3 4.7: three iterations in twelve minutes, zero humans in the inner loop.
Because each iteration is grounded in your real data and the judge's actual rationale, improvements compound. You set what good means; you approve the diff that ships; the agent does the bench-pressing.
No SDK to install. No glue code. The agent reads the same REST + MCP surface you'd use yourself.
05Three ways to run it
Same product. Pick the deployment that fits.
Cloud is the fastest start. The standalone app is for self-hosting on your own infra. The engine is a gem you mount inside an existing Rails app. Same code underneath: runs, prompts, metrics, the REST + MCP surface are identical across all three.
Cloud (hosted)
We host it for you. Sign up, paste your provider keys, run your first eval in minutes. No infra to run.
Best for: getting started fast or skipping ops.
Standalone app (self-host)
Clone the bundled Rails app, run it on your own server. Web + worker process, Postgres, your provider keys.
Best for: teams that need their data and infra in-house.
Rails engine (embed)
Already have a Rails 8 app? Add gem "completion-kit", run the generator, and it mounts alongside your code.
Best for: embedding the loop in a product you already own.
06Concepts
The pieces that make up a run.
A short glossary so you can read the rest of the docs (and the UI) without guessing what's what. The full glossary has longer entries plus origin facts for each concept.
- Prompt
- A versioned template plus model selection. The template is the text you send to an LLM, with
{{placeholders}}for variable inputs. Editing a published prompt forks a new version automatically. Old versions stay around, fully runnable. Each published version has a stable URL your app fetches at runtime. - Dataset
- A CSV of inputs you run a prompt against. Placeholder names in the template match column headers in the CSV; CompletionKit merges them, one prompt per row. Datasets are reusable across runs and versions, which is what makes regression testing across prompt changes possible at all.
- Run
- One execution of a prompt against a dataset, producing one response per row. A run is the unit of work, reproducible by its prompt version, dataset, model, and metrics. Each run gets an average score across all responses; you compare runs by diffing their scores.
- Judge-only run
- A run that grades an existing dataset column instead of generating anything. Use it when you've already got outputs (production logs, results from another system, hand-curated examples) and you want the LLM judge to score them against your metrics. Same scoring, same per-row review, no generation step.
- Regrade-only run
- A run that re-judges an existing run's responses against the current metric version, without re-generating any output. Useful when you've tightened the rubric and want to see how the scores change without re-spending on generation.
- Response
- One LLM output for one row of the dataset. Each response is judged against every metric on the run and gets per-metric scores from the LLM judge plus a per-response average. Drill into any response to see the input, the output, and the judge's reasoning.
- Metric
- A scoring criterion you define. Whatever good means for your product (empathy, clarity, accuracy, format compliance, policy adherence) with a 1-to-5 rubric describing what each score looks like. The LLM judge applies your rubric, not a generic "is this good?" prompt.
- Starter metric
- One of five preconfigured rubrics on the metrics page (Correctness, Instruction following, Format compliance, Tone, Conciseness). One click adopts it as a real metric in your org; you can edit the 1-to-5 bands before it goes live.
- Metric version
- Every change to a metric's rubric publishes a new metric version. Reviews carry the metric version they were judged under, so when you tighten the rubric, old reviews are flagged stale and the run page surfaces a "Re-run with current judge" action.
- Review
- One judge's evaluation of one response against one metric version. Has a score (1-5) and a written rationale. Reviews are what make scores defensible: click into any score and you see why the judge landed there. Disagree, and that's a signal to calibrate the metric.
- Verdict
- Your call on a review: agree, disagree, or borderline. Verdicts roll up into the trust panel as signal that the judge is (or isn't) scoring the way you would.
- Trust panel
- The metric page's calibration surface. Shows a Wilson confidence interval (how much signal you have) and a quadratic-weighted kappa (how aligned you and the judge are on the 1-to-5 ordinal scale) computed from your verdicts on the current metric version.
- Version
- Every change to a published prompt forks a new version. Old versions remain runnable and addressable: you can diff v3 vs v4, replay an old run, or revert. Promoting a version makes it the one your app picks up at the prompt's stable URL on the next request.
- Tag
- A label you attach to prompts, datasets, runs, or metrics. Useful for grouping across the org: production prompts vs experiments, model-specific datasets, support-category metrics. Tag filters work across every index page; one tag scopes the whole list.
- API key
- A bearer token that authenticates calls to your org's REST API and MCP server. Create one under Settings → API in-app and send it as
Authorization: Bearer <key>. Revocable at any time; we never see your provider keys (you bring those). - MCP server
- A built-in Model Context Protocol endpoint exposing 35+ tools (
runs_create,prompts_update,metric_versions_publish, …) over streamable HTTP. Connect Claude Code, Cursor, or any MCP client and drive the whole eval loop from a chat or coding agent.
07Next
Pick a place to start.
The fastest path depends on whether you want to host it yourself or skip that step.
08Learn more
Read deeper.
Long-form pieces on the concepts the docs introduce.
Concepts
What is prompt evaluation? →
Definition, why it isn't observability, and what a run and a metric really are.
Build
How to build an eval dataset →
How to source real inputs, how many rows you need, when to use expected outputs vs an LLM judge, and why holding the dataset fixed is what makes the baseline mean anything.
Trust
LLM-as-a-judge →
How a model can score another model's output, the biases it carries, and how to set it up so the number is one you can trust.
Run
Prompt regression testing →
Why a fixed eval dataset re-run on every change is the LLM-era equivalent of a regression test suite. How to set one up.
Versioning
Prompt versioning →
Immutable revisions, diffs, parent versions. Why treating a prompt like code is the move.
Improve
The automated eval loop →
Plug Claude Code or Cursor into the MCP server and let the agent climb the score on its own.