CompletionKit vs Promptfoo
Both test your prompts. Promptfoo is an open-source CLI for local evals and red teaming. CompletionKit is the hosted improvement loop: it scores whether to trust the judge, suggests the next revision, and serves the winning prompt to your app.
The short version
Promptfoo is an MIT-licensed CLI and library. You describe tests in a promptfooconfig.yaml file, run them locally or in CI, and it is best in class at red teaming and security scanning. It is developer-first and lives in your terminal and your pipeline. In March 2026 OpenAI agreed to acquire it; it stays MIT-licensed.
CompletionKit is a hosted improvement loop. An LLM judge scores each output, a trust panel tells you how far to trust that judge against your own labels, it drafts the next prompt revision from the judge's feedback, and once a version proves out you publish it so your app fetches the live prompt. It is independent and source-available, and you can self-host it.
Side by side
| Promptfoo | CompletionKit | |
|---|---|---|
| Interface | CLI + YAML, local-first | Hosted UI + REST API |
| Multi-provider (OpenAI, Anthropic, local) | ||
| LLM-as-a-judge scoring | ||
| Judge calibration / trust score | Manual | |
| AI-suggested prompt revisions | – | |
| Versioned prompts served to your app | – | |
| Non-coder prompt editing in a UI | – | |
| Red teaming + security scans | – | |
| CI/CD out of the box | Via API | |
| MCP server | ||
| Hosted runs + history | Optional | |
| License | MIT (open source) | BSL 1.1 (source-available) |
Promptfoo documents a judge-calibration workflow but leaves you to compute the agreement statistic yourself, hence "Manual". Both ship an MCP server.
Which one fits
Reach for Promptfoo if you want a free, MIT-licensed tool that lives in your repo and CI, you are comfortable in YAML and the command line, and red teaming or security testing is high on your list. It is the stronger choice for adversarial security work.
Reach for CompletionKit if you want the improvement loop done for you rather than assembled: a judge you can trust because the agreement is scored, the next revision drafted from the judge's feedback, prompts served to your app so non-coders can edit them safely, and a tool that stays independent of any model vendor.
They are not either-or. Plenty of teams run Promptfoo for red teaming and CompletionKit for the hosted improvement loop.
Questions
- What does CompletionKit do that Promptfoo doesn't?
- Three things. It scores how much you can trust the LLM judge (Promptfoo documents the calibration workflow but leaves you to compute the agreement statistic yourself). It reads the judge's per-row feedback and drafts the next prompt revision for you (Promptfoo has no built-in prompt optimizer). And it serves the published prompt to your app at runtime, so a prompt change is not a code change. Promptfoo keeps prompts in your repo as YAML or files.
- Does CompletionKit do red teaming?
- No. Adversarial red teaming and security scanning are Promptfoo's strength, and they are genuinely good at it. CompletionKit focuses on the evaluate, improve, and ship loop for the prompts already running in your app. If security testing is your priority, use Promptfoo.
- Is Promptfoo still independent?
- OpenAI agreed to acquire Promptfoo in March 2026, and it remains MIT-licensed. If you would rather your evaluation tool not be owned by a model vendor you test against, CompletionKit is independent and source-available, and you can self-host it so your eval data stays yours.
- Can I use both?
- Yes. They are not mutually exclusive. Run Promptfoo in CI for red teaming and use CompletionKit for the hosted improvement loop, judge calibration, and prompt serving. Pick whichever fits each job.