Your prompts need tests too.

Run every prompt against real data. Score each output with an LLM judge against criteria you define. Change anything, re-run, and see exactly what got better and what broke.

gem install completion-kit View on GitHub
CompletionKit run showing 5 outputs scored across Accuracy, Local Relevance, and Engagement by an LLM judge

The problem

You change a prompt. You ship it. You think it works better. It doesn't.

You can't test prompts by eyeballing random responses. CompletionKit runs every input through the model, scores every output, and shows you exactly what changed.

How it works

  1. Write a prompt.

    Start here. If your prompt uses variable inputs, add {{placeholders}} and upload a CSV. If it doesn't, just run it as-is.

  2. Define what "good" looks like.

    Create metrics with a scoring scale. The LLM judge uses them to score every output.

  3. Run, score, iterate.

    Pick a model, run it, read the scores. Change the prompt, the model, the temperature. Re-run and see what moved.

CompletionKit prompt detail view

See it in action

Animated walkthrough: prompts index, prompt detail, run with scored results, individual response with LLM judge feedback

How CompletionKit compares

OpenAI Evals Anthropic Workbench Braintrust Langfuse Promptfoo CompletionKit
Multi-providerNoNoYesYesYesYes
Local models (Ollama)NoNoNoYesYesYes
Custom scoring criteriaPartialPartialYesYesYesYes
AI suggestions from your dataNoGenericNoNoNoYes
Versioned prompts via APINoNoYesYesNoYes
MCP serverNoNoNoNoNoYes
Free + open sourcePartialPartialNoYesYesYes

Features

Multi-model

OpenAI, Anthropic, Ollama (or any OpenAI-compatible local endpoint), and 100+ models via OpenRouter. One tool, all of them.

Custom scoring criteria

Define metrics with 1-5 star bands. The LLM-as-judge scores every output against criteria you define, not a generic "is this good?" prompt.

Versioned prompts

Every prompt is versioned. Edit a published prompt and it forks a new version automatically. History is preserved.

AI-driven suggestions

When the scores tell you something's off, ask CompletionKit for an improved prompt. The suggestion is grounded in the LLM judge's actual feedback on your runs.

REST API + MCP server

Every resource is exposed via a bearer-token REST API and a built-in Model Context Protocol server with 36 tools. Drive it from a browser, an HTTP client, or from your IDE.

Free, MIT-licensed

Open source. Self-hostable. 100% test coverage. No per-seat pricing, no SaaS lock-in, no vendor account required.

Install

Add to your Gemfile:

gem "completion-kit"

Then run:

bin/rails generate completion_kit:install
bin/rails db:migrate

Set your provider keys via environment variables:

OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
OPENROUTER_API_KEY=...

Or clone the repo and run the bundled standalone app. See the README for the full walkthrough.

FAQ

Which providers does it support?

OpenAI, Anthropic, Ollama (or any OpenAI-compatible local endpoint), and 100+ models via OpenRouter.

Does it work outside Rails?

CompletionKit ships as a Rails engine, but it includes a bundled standalone Rails app you can deploy as a hosted service without writing any Rails code yourself.

Is it free?

Yes. MIT-licensed, free forever, no usage limits, no telemetry.

Who made it?

Built by Homemade Software.