Blog
Notes from the team building CompletionKit.
-
June 12, 2026
Your eval tool shouldn't work for the model it's grading
OpenAI agreed to acquire Promptfoo in March 2026. Nothing nefarious has to happen for that to matter. Self-preference bias is a measured property of LLM judges, and an eval vendor owned by a model lab is the same conflict one level up, at the org chart.
-
June 1, 2026
What is a judge-only run?
A judge-only run scores outputs you already have, with no generation step: production logs, hand-curated examples, a competitor's API, the older model you're about to replace. Drop the outputs into a column, point the run at it, define your metrics, the judge scores each row. The move is older than it looks (library assessors were grading retrieval output at Cranfield in the early 1960s), and it takes four shapes in practice.
-
May 31, 2026
MCP server for prompt evals: setting up Claude Code and Cursor
Point your coding agent's MCP client at your CompletionKit organization's endpoint and it gains a toolbox for prompt evals: list and revise prompts, kick off runs, read per-row scores, replay the judge. Here's the Claude Code config, the Cursor config, the toolbox by resource, a first task that exercises three of the tools end-to-end, and the four mistakes that bite on the first connection.
-
May 25, 2026
Let your coding agent improve your prompts: the automated eval loop
Plug Claude Code or Cursor into a CompletionKit MCP server and the agent runs the full loop on its own: draft a prompt, run it on your dataset, read the scores and the failing rows, revise, re-run, until the score clears the bar you set. You define what good means; the agent does the iterating. Here is the shape of the loop, where the idea came from (hill climbing, DSPy, MCP), and how to set it up so the version it hands you back is one you can actually ship.
-
May 22, 2026
Prompt versioning: treating prompts like code
A prompt is a production artifact, and any production artifact deserves a stable identifier per revision, a diff you can read, and a switch you can flip back. Here's where the moves came from (SCCS in 1972, RCS in 1982, Git in 2005), what changes when the artifact is a string a model reads instead of code a compiler runs, and why a git history alone is only half of what versioning actually does.
-
May 19, 2026
Prompt regression testing: catching the cases that used to work
A one-line prompt edit to fix one case can silently break twelve others. Prompt regression testing is the LLM-era test suite: a baseline of scored runs on a fixed dataset, re-run on every change, diffed against the last version. Here's why it matters, where the idea came from, and how to set one up.
-
May 18, 2026
How to build a prompt evaluation dataset
An evaluation dataset is the bar a prompt change has to clear. Here's how to source real inputs from production logs and support tickets, how many rows you actually need, how to cover edge cases on purpose, when to use expected outputs versus a judge, and why holding the dataset fixed is what makes the baseline mean anything.
-
May 16, 2026
What is prompt evaluation?
Prompt evaluation is scoring a prompt against a fixed dataset before you ship it. The held-out test set, repurposed. Here's a crisp definition, why it's not the same thing as observability, where the idea came from, and what a run and a metric actually are.
-
May 15, 2026
LLM-as-a-judge: a practical guide to scoring prompt outputs
Using one model to grade another is how you score open-ended outputs at a pace that keeps up with your edits. Here's how it works, where it came from, the biases it carries (position, verbosity, self-preference), and how to set it up so the number is one you can actually trust.
-
May 11, 2026
Why we built CompletionKit
Shipping a prompt is the easy part. The hard part is the year afterward (repairs, improvements, model swaps) without quietly breaking the cases that already worked. Here's the feature matrix versus the tools we admire, and why a maintenance loop matters as much as a dev loop.
-
May 11, 2026
CompletionKit Cloud is live
The hosted version of CompletionKit is open for signups. Free tier, no install, and a real discount for early adopters.