# CompletionKit

> CompletionKit is a prompt evaluation and improvement loop for AI apps. Define what good means with a 1-5 rubric, run prompts against a fixed dataset, score every output with an LLM judge, and see per-row scores plus the judge's rationale before you ship.

CompletionKit Cloud is the hosted multi-tenant version. The same product also ships as a Rails engine (`completion-kit` gem) you can mount in an existing Rails app, and as a bundled standalone Rails app for self-hosters. It is source-available (BSL 1.1), not open source.

## Start

- [Sign up](https://completionkit.com/registration/new): create an account and run your first evaluation. Self-serve, no sales call.
- [Pricing](https://completionkit.com/pricing): the plan tiers and what each unlocks.
- [About](https://completionkit.com/about): who makes CompletionKit and why it exists.

## Docs

- [Overview and concepts](https://completionkit.com/docs): the four moves of the loop, the glossary, and where to start.
- [REST and MCP reference](https://completionkit.com/docs/api): every API endpoint, every MCP tool, with copy-paste examples. Point a coding agent (Claude Code, Cursor) at the MCP server and it can drive the whole evaluate-and-improve loop.
- [Glossary](https://completionkit.com/glossary): one-page definitions for prompt evaluation, LLM judge, prompt versioning, regression testing, eval dataset, metric, judge calibration, and the automated prompt optimization loop.
- [Changelog](https://completionkit.com/changelog): what shipped, dated.

## Blog (the Prompt Loop campaign)

- [Agent prompt testing: test the prompts inside your agents](https://completionkit.com/blog/agent-prompt-testing): Agent prompt testing is scoring the prompts inside an AI agent against a fixed dataset, the same way you test any prompt. How it differs from agent behavior eval and observability.
- [CI/CD for LLM applications](https://completionkit.com/blog/ci-cd-for-llm-applications): Practical guide to CI/CD for LLM apps: gate prompt and model changes on a scored eval run via the REST API, with a worked GitHub Actions job and two-tier thresholds.
- [Let your coding agent improve the judge itself](https://completionkit.com/blog/improve-the-judge-with-mcp): How to let a coding agent improve the LLM judge itself over MCP: read disagreements, draft rubric rewrites, compare against human verdicts, publish only on a recommend verdict.
- [Your eval tool shouldn't work for the model it's grading](https://completionkit.com/blog/eval-tool-independence): Why evaluation infrastructure should be independent of the model labs it scores, and what to ask any eval vendor about ownership, defaults, and data.
- [Best prompt evaluation tools in 2026](https://completionkit.com/blog/best-prompt-evaluation-tools-2026): An honest roundup of the prompt evaluation tools that matter in 2026: who each is for, the standout strength, and the tradeoff. Alphabetical, not ranked.
- [CompletionKit vs Braintrust, Langfuse, Promptfoo, Galileo, DeepEval: prompt evaluation tools compared](https://completionkit.com/blog/completionkit-vs-galileo-vs-braintrust-vs-promptfoo): Honest comparison of the eight prompt evaluation tools that matter in 2026 (CompletionKit, Galileo, Braintrust, Promptfoo, Langfuse, OpenAI Evals, Anthropic Workbench, DeepEval) on seven axes.
- [What is a judge-only run?](https://completionkit.com/blog/what-is-a-judge-only-run): A judge-only run scores outputs you already have, with no generation step. Use it on production logs, gold sets, or another system's results.
- [MCP server for prompt evals: setting up Claude Code and Cursor](https://completionkit.com/blog/mcp-server-for-prompt-evals): Setup walkthrough: connect Claude Code or Cursor to the CompletionKit MCP server for prompt evals. Real config snippets, the toolbox by resource, troubleshooting.
- [Let your coding agent improve your prompts: the automated eval loop](https://completionkit.com/blog/automated-prompt-optimization-loop): How to point Claude Code or Cursor at the CompletionKit MCP server and let the agent run the prompt eval loop on its own until the score clears your bar.
- [Prompt versioning: treating prompts like code](https://completionkit.com/blog/prompt-versioning): Prompt versioning means treating each change as an immutable revision with a stable id, a parent, and a readable diff. Lineage: SCCS, RCS, Git.
- [Prompt regression testing: catching the cases that used to work](https://completionkit.com/blog/prompt-regression-testing): Prompt regression testing is the LLM-era test suite: a fixed eval dataset re-run on every prompt change so silent regressions get caught before shipping.
- [How to build a prompt evaluation dataset](https://completionkit.com/blog/how-to-build-a-prompt-eval-dataset): Step-by-step: how to pick real inputs, how many rows you need, expected outputs vs LLM judge, and why holding the eval dataset fixed makes the baseline mean anything.
- [What is prompt evaluation?](https://completionkit.com/blog/what-is-prompt-evaluation): Prompt evaluation is scoring a prompt against a fixed dataset before you ship it. Definition, why it isn't observability, where the idea came from.
- [LLM-as-a-judge: a practical guide to scoring prompt outputs](https://completionkit.com/blog/llm-as-a-judge): How LLM-as-a-judge works, where it came from, the biases it carries (position, verbosity, self-preference), and how to set it up so the score is one you can trust.
- [Why we built CompletionKit](https://completionkit.com/blog/why-we-built-completionkit): Shipping a prompt is the easy part. The hard part is the year afterward without quietly breaking what already worked. Why we built CompletionKit for that.
- [CompletionKit Cloud is live](https://completionkit.com/blog/completionkit-cloud-is-live): CompletionKit Cloud is open for signups: hosted prompt evaluation with a free tier, no install, and a real discount for early adopters.

## Compare

- [CompletionKit vs Promptfoo](https://completionkit.com/compare/promptfoo): how the integrated evaluate-and-improve loop differs, and where each fits.
- [CompletionKit vs Braintrust](https://completionkit.com/compare/braintrust): feature, focus, and pricing comparison.
- [CompletionKit vs Langfuse](https://completionkit.com/compare/langfuse): feature, focus, and pricing comparison.
- [CompletionKit vs DeepEval](https://completionkit.com/compare/deepeval): feature, focus, and pricing comparison.

## Code

- [Engine on GitHub](https://github.com/homemade-software-inc/completion-kit): the source-available Rails engine.
- [completion-kit on RubyGems](https://rubygems.org/gems/completion-kit): the gem releases.

## Legal

- [Privacy](https://completionkit.com/legal/privacy), [Terms](https://completionkit.com/legal/terms), [DPA](https://completionkit.com/legal/dpa), [Security](https://completionkit.com/legal/security), [Accessibility](https://completionkit.com/legal/accessibility).