Why we built CompletionKit

You ship an AI feature. It works. The demo lands, the launch goes fine, customers start using it. Three weeks later someone forwards you a screenshot: the model said something weird on an input you'd never seen. You open the prompt, add a sentence to handle it, push.

And then the question that should be keeping you up at night: did that one-sentence change just break twelve cases that were working fine?

You don't know. You can't know — not without re-running the prompt against a representative set of inputs and checking whether the outputs got better, worse, or sideways. Most teams don't do that, because doing it by hand is miserable and doing it well requires infrastructure most teams never set up. So they ship the change, hope, and find out from the next screenshot.

That gap — the distance between "I changed the prompt" and "I know the change was safe" — is why we built CompletionKit.

Two different questions

There's a question you ask while you're building a prompt: is this good enough to ship? You iterate, you eyeball outputs, you tweak, and eventually it's good and you ship it. Plenty of tools help with this. It's the fun part.

There's a different question you ask for the entire rest of the product's life: it's live and I need to change it — is it still good after I changed it? A customer hit an edge case. A new model came out and you want to switch. Someone on the team rephrased a section for clarity. Every one of those is a change to a thing that's in production, and every one of them can silently regress behaviour that was fine. This is the maintenance question, and over a product's lifetime you'll ask it a hundred times for every once you asked the build question.

The build question is a sprint. The maintenance question is the marathon. We think the marathon is underserved.

What it actually takes to answer the maintenance question

To know that a change didn't regress, you need four things, and they have to work together:

  1. A versioned record of the prompt. Not "the current string in the config file" — every revision, with a stable identity, so you can say "v7 behaved like this" and mean it next month.
  2. A baseline. A fixed set of real inputs and the scores the current version gets on them. That's the bar. Any change has to clear it.
  3. Fast re-evaluation. Cheap and quick enough to run on every edit, not a quarterly ritual. If it's slow, nobody does it, and the baseline rots.
  4. A serving layer. The prompt your app runs in production should be the prompt you tested — same versioned artifact, fetched by your app, not a copy that drifts. Otherwise "I tested the change" and "I shipped the change" are two separate, divergeable things.

Now look at the feature matrix on our home page through that lens. Every row is an answer to "what does it take to maintain a prompt, not just write one."

Multi-provider, and local models too

Over a product's life you will change models. A cheaper one ships, a better one ships, a provider deprecates the one you're on. Each switch is a maintenance event: you need to re-baseline the prompt against the new model and confirm it still clears the bar — sometimes the same prompt that scored 4.6 on one model scores 3.9 on another, and you'd rather find out before your users do. A tool locked to one provider can't help you with the switch; it is the thing you're switching away from. Local models (Ollama, or anything OpenAI-compatible) matter for the same reason plus two more: your dev loop is faster when you're not waiting on a remote API, and you're not paying per token to run your test suite a hundred times a day.

Custom scoring criteria

Generic metrics — "is it coherent", "is it relevant" — won't catch your regressions, because your regressions are specific to your product. Maybe what matters is "does the reply acknowledge the customer's order number" or "is the JSON valid" or "does it never promise a refund." The things that break when you change a prompt are the things you cared about enough to put in the prompt. Your metrics have to be that specific. CompletionKit's scoring is built around exactly this: you write the criterion and a 1–5 rubric, and an LLM judge applies it to every response in a run. When a change drops "acknowledges order number" from 4.8 to 3.1, you see it immediately, with the failing responses right there.

AI suggestions from your data

When something is regressing, the next move shouldn't be "scroll through 50 outputs and squint." CompletionKit looks at what's failing on your actual data and proposes concrete prompt changes — not generic "be more specific" advice, but edits grounded in the responses that scored low. It's the difference between a diagnostic and a wall of logs. As far as we know, nobody else does this from your own evaluation data.

Test prompts

Table stakes. Everyone in this space lets you run a prompt against inputs and look at the results. We do too. The differentiation isn't here; it's in everything around it.

Serve prompts to your app

This is the maintenance multiplier. In CompletionKit, every published prompt version has a URL. Your app fetches the template and model from that URL and runs the LLM the way it already does — no SDK, just JSON over HTTP. Which means "make a change to the prompt" is "publish a new version," not "edit a string in your codebase and redeploy." The artifact you tested and the artifact in production are the same object. You can roll a version back the way you'd roll back a deploy. You can see exactly which version was live when that weird screenshot happened. The eval tool and the prompt-of-record are one thing — that's the point.

MCP server

Each organization gets an MCP endpoint, so an agent — or you, through Claude — can drive the whole loop: create a prompt, kick off a run against a dataset, read the scores, propose a fix. The maintenance loop becomes something you can hand to a tool, not just click through.

Free and self-hostable

A maintenance tool is, by definition, something you depend on for the long haul. It would be a bad joke to build your prompt-change discipline on a thing you can't keep. The engine is source-available — free for any use, including production — and you can run it yourself forever. CompletionKit Cloud is for the people who'd rather not.

The tools we like, and the gap we're filling

We didn't build this in a vacuum. There's good work in this space and we've learned from a lot of it.

OpenAI Evals is a solid evaluation framework — if you live in the OpenAI ecosystem and you're comfortable in code, it does the job. But it's single-provider, it's a dev-time harness rather than a system of record, and there's no serving layer: the eval and the production prompt stay separate.

Anthropic Workbench is genuinely lovely for iterating on a prompt in the Console — fast, tactile, great for the build phase. But it's Anthropic-only, what you do there is ephemeral rather than versioned, and it doesn't serve prompts to your app or maintain a baseline you can regress against.

Braintrust is the closest peer and we have a lot of respect for it — full eval, observability, prompt management, serving, the works. The tradeoffs: it's a hosted product you can't self-host, and it's a heavier commitment than a team that just wants a tight prompt-change loop may want. If you want the full platform and don't mind it being SaaS-only, it's excellent.

Langfuse is great open-source observability and tracing, self-hostable, with prompt management bolted on. It's strongest as a "what happened in production" tool; evaluation is a secondary concern rather than the spine, and the LLM-judge-first scoring model isn't its center of gravity. If observability is your primary need, it's a strong pick.

Promptfoo is a beloved CLI for evaluating prompts, and it's terrific in CI — fast, scriptable, easy to drop into a pipeline. But it's a test runner, not a system of record: no versioned prompt store, no serving layer, no UI for the non-engineers on your team, no lifecycle past "the test passed." It answers the build question well and doesn't try to answer the maintenance one.

Our bet, in one sentence: the evaluation tool and the prompt-serving layer should be the same thing — versioned, source-available, with an LLM-judge-first scoring model — so that testing a change and shipping a change stop being two separate acts. That's the gap. That's what CompletionKit is.

Put the loop in before you need it

The expensive version of this story is the one where you don't have any of this, the screenshots keep coming, you keep patching the prompt, and six months in nobody on the team can say with confidence what the prompt does or how it got that way. The cheap version is the one where, on day one, the prompt is a versioned artifact with a baseline, your app fetches it by version, and every change runs the baseline before it ships. The cheap version costs an afternoon to set up. The expensive version costs a quarter of NPS.

Shipping the prompt was the easy part. CompletionKit is for the year after.

Sign up free   Read the launch post

Built by Homemade Software. Questions, disagreements, "you got Promptfoo wrong" — support@completionkit.com.

← All posts