Blog

Notes from the team building CompletionKit.

June 25, 2026

CI/CD for LLM applications

CI/CD for an LLM application is gating every change that touches the model the same way you gate a code change: re-run a scored evaluation on the proposed version, compare per-row scores to the previous baseline, and fail the build on any drop you didn't sign off. The wiring over the REST API, a worked GitHub Actions job, and thresholds set to survive judge noise rather than fail a green build.
June 13, 2026

Let your coding agent improve the judge itself

Two loops, one shape. The first lets a coding agent climb the eval score by editing the prompt. The second lets it climb the agreement-with-humans number by editing the rubric. Three MCP tools close it: agreements_list, metrics_suggest_variants, judges_compare. The agent reads disagreements, drafts candidate rubrics, and publishes a candidate only when judges_compare returns a recommend verdict.
June 12, 2026

Your eval tool shouldn't work for the model it's grading

OpenAI agreed to acquire Promptfoo in March 2026. Nothing nefarious has to happen for that to matter. Self-preference bias is a measured property of LLM judges, and an eval vendor owned by a model lab is the same conflict one level up, at the org chart.
June 7, 2026

Best prompt evaluation tools in 2026

The credible prompt evaluation tools in 2026, alphabetical and not ranked: Anthropic Workbench, Braintrust, CompletionKit, DeepEval, Galileo, Langfuse, OpenAI Evals, and Promptfoo. An honest read on who each one is for, plus how the category formed out of OpenAI Evals in 2023 and consolidated through three acquisitions in early 2026.
June 4, 2026

CompletionKit vs Braintrust, Langfuse, Promptfoo, Galileo, DeepEval: prompt evaluation tools compared

Eight prompt evaluation tools (OpenAI Evals, Anthropic Workbench, Braintrust, Langfuse, Promptfoo, Galileo, DeepEval, and CompletionKit), side by side on the axes that matter for keeping prompts working in production. An honest read on where each one wins, and the decision rules for picking the one that fits.
June 1, 2026

What is a judge-only run?

A judge-only run scores outputs you already have, with no generation step: production logs, hand-curated examples, a competitor's API, the older model you're about to replace. Drop the outputs into a column, point the run at it, define your metrics, the judge scores each row. The move is older than it looks (library assessors were grading retrieval output at Cranfield in the early 1960s), and it takes four shapes in practice.
May 31, 2026

MCP server for prompt evals: setting up Claude Code and Cursor

Point your coding agent's MCP client at your CompletionKit organization's endpoint and it gains a toolbox for prompt evals: list and revise prompts, kick off runs, read per-row scores, replay the judge. Here's the Claude Code config, the Cursor config, the toolbox by resource, a first task that exercises three of the tools end-to-end, and the four mistakes that bite on the first connection.
May 25, 2026

Let your coding agent improve your prompts: the automated eval loop

Plug Claude Code or Cursor into a CompletionKit MCP server and the agent runs the full loop on its own: draft a prompt, run it on your dataset, read the scores and the failing rows, revise, re-run, until the score clears the bar you set. You define what good means; the agent does the iterating. Here is the shape of the loop, where the idea came from (hill climbing, DSPy, MCP), and how to set it up so the version it hands you back is one you can actually ship.
May 22, 2026

Prompt versioning: treating prompts like code

A prompt is a production artifact, and any production artifact deserves a stable identifier per revision, a diff you can read, and a switch you can flip back. Here's where the moves came from (SCCS in 1972, RCS in 1982, Git in 2005), what changes when the artifact is a string a model reads instead of code a compiler runs, and why a git history alone is only half of what versioning actually does.
May 19, 2026

Prompt regression testing: catching the cases that used to work

A one-line prompt edit to fix one case can silently break twelve others. Prompt regression testing is the LLM-era test suite: a baseline of scored runs on a fixed dataset, re-run on every change, diffed against the last version. Here's why it matters, where the idea came from, and how to set one up.
May 18, 2026

How to build a prompt evaluation dataset

An evaluation dataset is the bar a prompt change has to clear. Here's how to source real inputs from production logs and support tickets, how many rows you actually need, how to cover edge cases on purpose, when to use expected outputs versus a judge, and why holding the dataset fixed is what makes the baseline mean anything.
May 16, 2026

What is prompt evaluation?

Prompt evaluation is scoring a prompt against a fixed dataset before you ship it. The held-out test set, repurposed. Here's a crisp definition, why it's not the same thing as observability, where the idea came from, and what a run and a metric actually are.
May 15, 2026

LLM-as-a-judge: a practical guide to scoring prompt outputs

Using one model to grade another is how you score open-ended outputs at a pace that keeps up with your edits. Here's how it works, where it came from, the biases it carries (position, verbosity, self-preference), and how to set it up so the number is one you can actually trust.
May 11, 2026

Why we built CompletionKit

Shipping a prompt is the easy part. The hard part is the year afterward (repairs, improvements, model swaps) without quietly breaking the cases that already worked. Here's the feature matrix versus the tools we admire, and why a maintenance loop matters as much as a dev loop.
May 11, 2026

CompletionKit Cloud is live

The hosted version of CompletionKit is open for signups. Free tier, no install, and a real discount for early adopters.