Run every prompt against real data. Score each output with an LLM judge against criteria you define. Change anything, re-run, and see exactly what got better and what broke.
gem install completion-kit
View on GitHub
You change a prompt. You ship it. You think it works better. It doesn't.
You can't test prompts by eyeballing random responses. CompletionKit runs every input through the model, scores every output, and shows you exactly what changed.
Start here. If your prompt uses variable inputs, add {{placeholders}} and upload a CSV. If it doesn't, just run it as-is.
Create metrics with a scoring scale. The LLM judge uses them to score every output.
Pick a model, run it, read the scores. Change the prompt, the model, the temperature. Re-run and see what moved.
| OpenAI Evals | Anthropic Workbench | Braintrust | Langfuse | Promptfoo | CompletionKit | |
|---|---|---|---|---|---|---|
| Multi-provider | No | No | Yes | Yes | Yes | Yes |
| Local models (Ollama) | No | No | No | Yes | Yes | Yes |
| Custom scoring criteria | Partial | Partial | Yes | Yes | Yes | Yes |
| AI suggestions from your data | No | Generic | No | No | No | Yes |
| Versioned prompts via API | No | No | Yes | Yes | No | Yes |
| MCP server | No | No | No | No | No | Yes |
| Free + open source | Partial | Partial | No | Yes | Yes | Yes |
OpenAI, Anthropic, Ollama (or any OpenAI-compatible local endpoint), and 100+ models via OpenRouter. One tool, all of them.
Define metrics with 1-5 star bands. The LLM-as-judge scores every output against criteria you define, not a generic "is this good?" prompt.
Every prompt is versioned. Edit a published prompt and it forks a new version automatically. History is preserved.
When the scores tell you something's off, ask CompletionKit for an improved prompt. The suggestion is grounded in the LLM judge's actual feedback on your runs.
Every resource is exposed via a bearer-token REST API and a built-in Model Context Protocol server with 36 tools. Drive it from a browser, an HTTP client, or from your IDE.
Open source. Self-hostable. 100% test coverage. No per-seat pricing, no SaaS lock-in, no vendor account required.
Add to your Gemfile:
gem "completion-kit"
Then run:
bin/rails generate completion_kit:install bin/rails db:migrate
Set your provider keys via environment variables:
OPENAI_API_KEY=... ANTHROPIC_API_KEY=... OPENROUTER_API_KEY=...
Or clone the repo and run the bundled standalone app. See the README for the full walkthrough.
OpenAI, Anthropic, Ollama (or any OpenAI-compatible local endpoint), and 100+ models via OpenRouter.
CompletionKit ships as a Rails engine, but it includes a bundled standalone Rails app you can deploy as a hosted service without writing any Rails code yourself.
Yes. MIT-licensed, free forever, no usage limits, no telemetry.
Built by Homemade Software.