Skip to content

CompletionKit vs Braintrust, Langfuse, Promptfoo, Galileo, DeepEval: prompt evaluation tools compared

If you're picking a prompt evaluation tool in 2026, the eight worth a serious look are OpenAI Evals, Anthropic Workbench, Braintrust, Langfuse, Promptfoo, Galileo, DeepEval, and CompletionKit. They aren't aiming at the same job. This post puts them side by side on the seven axes that matter for the maintenance loop, then gives each one an honest paragraph: where it came from, where it wins, what it's not trying to be. The goal is the comparison page that gets cited because the facts hold up, not the one that tilts toward whoever published it.

For background on what prompt evaluation actually is, see What is prompt evaluation?. The framing this piece leans on (the build loop versus the maintenance loop, the artifact-of-record problem) lives in Why we built CompletionKit.

The comparison

Seven axes. Multi-provider means you can grade outputs from more than one provider in the same tool. Local models (Ollama) means a local or OpenAI-compatible endpoint works without paying per token. Custom scoring metrics means rubrics you define, not just generic out-of-the-box scores. AI suggestions grounded in your data means revisions proposed from the rows that scored low, not generic prompt-engineering advice. Versioned prompts via API means each published version has a stable identifier your app can fetch at runtime. MCP server means a Model Context Protocol endpoint a coding agent can drive. Free + self-hostable means you can run it yourself, for free, in production. See LLM-as-a-judge for the scoring half and MCP server for prompt evals for the agent half.

Tool Multi-provider Local models Custom metrics AI suggestions Versioned prompts MCP server Free + self-host
OpenAI Evals No No Yes No No No Yes
Anthropic Workbench No No Partial No Partial No No
Braintrust Yes Yes Yes Yes Yes Yes Enterprise
Langfuse Yes Yes Yes Partial Yes Yes Yes
Promptfoo Yes Yes Yes No No Yes Yes
Galileo Yes Partial Yes No Partial No No
DeepEval Yes Yes Yes No No No Yes
CompletionKit Yes Yes Yes Yes Yes Yes Yes

Partial marks a capability that exists but isn't the focus, isn't one-click, or has a meaningful caveat (Anthropic-only model coverage, prompt versions that don't serve as a public API, suggestions that aren't grounded in your eval data). Enterprise marks self-hosting reserved for paid enterprise plans.

The eight, one paragraph each

OpenAI Evals

Open-sourced by OpenAI on March 14, 2023, the day GPT-4 shipped, under an MIT licence. It's the framework the prompt-eval era is named after: a Python harness for registering test sets, running models over them, and grading the completions with rules or other models. OpenAI Evals wins inside the OpenAI ecosystem when you're comfortable in code and you want a dev-time harness, not a system of record. It's single-provider in spirit, has no UI for the non-engineers on your team, and there's no serving layer: the eval and the production prompt stay separate things.

Anthropic Workbench

The prompt iteration surface inside Anthropic's developer Console. Workbench gives you fast in-browser editing, a Prompt Improver that rewrites a prompt using chain-of-thought prompt-engineering techniques, and an Evaluation tab that lets you score outputs on a five-point scale with an optional "ideal output" column. It's a great build-phase tool for Claude. The tradeoffs are visible: it's Anthropic-only, the eval surface isn't designed for non-Anthropic models, and prompts you save live inside the Console rather than as a versioned API your app fetches.

Braintrust

Founded in 2023 by Ankur Goyal (formerly engineering at MemSQL and Impira), Braintrust is the closest hosted-platform peer to CompletionKit in scope: full eval, observability, prompt management, prompt serving, multi-provider model proxy. The "Loop" feature reads failing rows and proposes prompt revisions, which is the part most evaluators don't try to do. The tradeoffs are real: it's hosted-first, with self-hosting reserved for enterprise plans, and the first paid tier is $249/month, so it suits teams that want the full platform and don't mind a SaaS commitment. If that's you, it's excellent.

Langfuse

Started in 2022 by Marc Klingen, Maximilian Deichmann, and Clemens Rawert, Langfuse went through Y Combinator's W23 batch and shipped publicly in August 2023 as the open-source LLM engineering platform. Its centre of gravity is observability and tracing, with prompt management, datasets, and evals layered on top, including native LLM-judge scoring and an explicit human-labels-versus-judge agreement view. It's MIT-licensed, self-hostable, and was acquired by ClickHouse in January 2026, which makes the roadmap a ClickHouse roadmap now. If observability is your primary need, Langfuse is the open-source pick.

Promptfoo

Founded in 2024 by Ian Webster (ex-Discord) and Michael D'Angelo as an open-source, MIT-licensed CLI for evaluating prompts in CI. It's terrific where it lives: fast, scriptable, easy to drop into a pipeline, with strong red-teaming coverage that the others don't match. The shape is a test runner, not a system of record, so there's no versioned prompt store and no serving layer, and the lifecycle stops at "the test passed." OpenAI announced its acquisition of Promptfoo in early 2026. The CLI is still free; whether the OSS edition stays the centre of gravity is the open question.

Galileo

Founded in 2021 by Atindriyo Sanyal, Vikram Chatterji, and Yash Sheth (Apple Siri, Google AI, Uber AI), Galileo launched its LLM Evaluation, Experimentation and Observability platform in September 2023. It's the enterprise-shaped option: SaaS, VPC, and on-prem deployment, a Luna evaluation suite, and a strong focus on governance and trust scoring. Pricing starts free for low-volume usage and climbs to enterprise. Galileo is commercial, not open source, and Cisco announced its intent to acquire it in April 2026, folding the team into Splunk Observability Cloud. It's worth a look if your procurement team requires VPC or on-prem and you're already running real production volume. (To avoid a common mix-up: this is the LLM evaluation platform at galileo.ai, not the unrelated UI-design tool of a similar name.)

DeepEval

Open-sourced by Confident AI (Jeffrey Ip and Kritin Vongthongsri, Y Combinator W25) under Apache 2.0. DeepEval frames LLM evaluation as pytest-style unit tests: you write test files that import metrics like G-Eval, answer relevancy, faithfulness, and hallucination, and run them with the regular pytest runner. That model fits engineering teams already living in pytest, and the metric library is the deepest of any open-source evaluator. The tradeoffs are the same ones every test framework shares: no serving layer, no UI for non-engineers, no version-of-record for the prompt. Pair it with a serving layer if you need one.

CompletionKit

Source-available under BSL 1.1, free for any use including production, mountable into a Rails app or run standalone, with CompletionKit Cloud as the hosted path. The bet is narrow on purpose, and it is the contrast with the code-first frameworks above: a metric is one plain-language rubric scored 1–5, not a catalogue of metrics to wire up, so a PM can write it and an engineer can ship it. The judge carries a trust number, how well its scores agree with your own labels, measured with a quadratic-weighted kappa, where DeepEval and most of this list leave that cross-check to you. The prompt you serve is the same versioned artifact you tested, fetched by URL at runtime. Multi-provider (OpenAI, Anthropic, Ollama, OpenRouter), MCP-drivable, and priced flat for the whole team rather than per seat. Not the deepest framework on this list; the one a whole team can run and believe. Why we built CompletionKit is the long version.

How to pick

  1. You want the lightest possible CI test runner. Promptfoo, or DeepEval if you live in pytest. Both are MIT/Apache, both run locally, neither tries to be a system of record.
  2. You want a hosted platform with prompt management, observability, and serving. Braintrust if you're fine with SaaS-only on the free tier and a $249 first paid tier. Langfuse if open source matters and observability is your centre of gravity.
  3. Your procurement team requires VPC or on-prem. Galileo, or self-hosted Langfuse, or the standalone CompletionKit app.
  4. You only ship on Anthropic, and you want the fastest in-Console loop. Anthropic Workbench, with the understanding that you'll outgrow it the moment you want to evaluate a non-Claude model.
  5. You want the eval tool and the prompt your app runs to be the same versioned artifact, source-available, with an MCP endpoint your agent can drive. CompletionKit, which is the shape we kept needing and the reason we built it.

The honest part: every tool on this list is doing real work, and most teams will be fine with two or three of them in combination. The wrong move is to keep treating prompts as configuration and to ship without any of these in the loop. The right move is to pick one and start; the eval discipline is what compounds.

FAQ

What's the best prompt evaluation tool?

There isn't one best for everyone, because they aim at different jobs. Braintrust is the most complete hosted platform. Langfuse leads on open-source observability. Promptfoo is the test runner you drop into CI. Galileo fits enterprises that need VPC or on-prem deployment. DeepEval handles LLM evaluation as pytest-style unit tests. OpenAI Evals and Anthropic Workbench are good fits inside their own ecosystems. CompletionKit is the source-available pick when you want the eval loop and the prompt-serving layer to be the same versioned artifact.

Which prompt evaluation tools are open source or self-hostable?

OpenAI Evals (MIT), Langfuse (MIT), Promptfoo (MIT), and DeepEval (Apache 2.0) are open source and self-hostable. CompletionKit's engine is source-available under BSL 1.1 and free for any use including production, mountable into a Rails app or run standalone. Braintrust offers self-hosting only on enterprise plans. Galileo is commercial with SaaS, VPC, and on-prem deployment options. Anthropic Workbench is hosted by Anthropic with no self-host path.

Which of these tools serve the prompt to my app, not just test it?

Braintrust, Langfuse, and CompletionKit publish each prompt version behind a stable identifier your app fetches at runtime, so the artifact you tested and the artifact in production are the same object. Galileo offers prompt management features. OpenAI Evals, Promptfoo, and DeepEval are evaluators with no serving layer: they grade outputs, but your prompt lives somewhere else. Anthropic Workbench stores prompts inside the Console rather than serving them as an external API.

Did Promptfoo, Langfuse, and Galileo get acquired?

Yes, all three in 2026. Langfuse was acquired by ClickHouse in January 2026, Promptfoo by OpenAI in early 2026, and Cisco announced its intent to acquire Galileo in April 2026, folding it into Splunk Observability Cloud. All three are still available, but their roadmaps are now tied to their acquirers.

Which of these can a non-engineer use?

It splits by shape. The pure frameworks are code-first: OpenAI Evals and DeepEval are Python, Promptfoo is a Node CLI driven by YAML, so authoring an eval means writing code. The platforms are friendlier to non-engineers: Braintrust can draft a scorer from a plain-language description, Langfuse configures its LLM judge in the dashboard, Galileo ships managed scorers you don't write yourself, and Anthropic Workbench is in-browser (though Claude-only). CompletionKit sits with the platforms, with one difference worth naming: a metric is just a plain-language rubric scored 1 to 5, simple enough for a PM or domain expert to write and review, and every member is included at a flat price, so bringing non-engineers into the loop costs nothing per seat.

Try CompletionKit free   Read why we built it

Built by Homemade Software. Got a fact wrong about one of the tools above? Tell us: support@completionkit.com or r/completionkit.

← All posts