June 12, 2026 · The CompletionKit team

Your eval tool shouldn't work for the model it's grading

In March 2026, OpenAI agreed to acquire Promptfoo. Congratulations to the team, sincerely. It's a good tool, we've said so in print, and it stays MIT-licensed. This isn't a complaint about them. It's a question about structure.

If you've run an LLM judge for any length of time you know about self-preference bias. Ask a model to grade outputs and it scores its own family a notch higher than the competition, reliably enough that the standard advice, ours included, is to judge with a different family than the one you're evaluating. Nobody thinks the model is cheating. Its own writing just sits closer to its own distribution. The bias is structural, which is why you design around it rather than asking the model to promise to be fair.

An evaluation company owned by a model lab is the same diagram one level up. Nobody at the acquired tool has to do anything wrong. The conflict arrives with the org chart. The roadmap now has an owner with a model to sell. The default judge setting has a natural answer. The "works best with" recommendation has gravity. And your eval traffic, which is to say your prompts, your failure cases, and the exact rows where a competitor's model beats the house model, flows through infrastructure the lab controls. Accounting learned this lesson the expensive way. Arthur Andersen didn't fail at Enron because its auditors couldn't count; it failed because the auditor was also a consultant on the client's payroll, and after 2002 Congress walled off what an auditor is allowed to sell the company it audits. Conflicts of interest don't need villains. That's why they're called structural.

So the questions to ask any eval vendor are boring on purpose. Who owns them, and does the owner sell a model? What's the default judge, and who benefits from the default? Where does your eval data live, and who can see it? If the answers change after an acquisition, what's your exit?

None of this is unique to us, and that's rather the point. Braintrust and Langfuse are independent too, as of this writing. We're not claiming a moat. We're suggesting a checklist item.

For the record, here's where we sit. CompletionKit is independent. It's source-available under BSL 1.1, so you can read the scoring code you're trusting. It's self-hostable, so your eval data never has to leave your infrastructure. And the judge is yours to choose: OpenAI, Anthropic, Azure AI Foundry, a local model through Ollama, or a hundred others through OpenRouter, with the standing advice to grade with a different family than you generate with. A measuring instrument should be the least conflicted thing in your stack.

If you're weighing this today, the side-by-sides are here: CompletionKit vs Promptfoo and CompletionKit vs Braintrust.

Run your evals somewhere independent

← All posts