May 31, 2026 ·
MCP server for prompt evals: setting up Claude Code and Cursor
To let a coding agent drive evals against your prompts, point its MCP client at your CompletionKit organization's MCP endpoint. The endpoint lives at https://completionkit.com/orgs/your-org-slug/mcp, authenticates with a Bearer token you create under Settings then API, and exposes a toolbox the agent can call from inside Claude Code or Cursor: list and revise prompts, kick off runs, read per-row scores and rationales, replay the judge against existing outputs. The rest of this piece is the full setup: the Claude Code config, the Cursor config, the tool reference, a first task that exercises three tools end-to-end, and the four mistakes that bite on the first connection.
If you have not read the conceptual companion to this piece, let your coding agent improve your prompts: the automated eval loop covers why you would want this and what the agent does once connected. This post is the wiring, not the why.
Where MCP came from
The Model Context Protocol was announced by Anthropic on November 25, 2024, built by Anthropic engineers David Soria Parra and Justin Spahr-Summers, and open-sourced alongside SDKs and a handful of reference servers (Google Drive, Slack, GitHub, Postgres, Puppeteer). The point of the protocol is that an MCP-speaking client (Claude Code, Cursor, others) and an MCP-speaking server can find each other at runtime without either side knowing about the other in advance. Before MCP, hooking an agent into an eval tool was a per-agent, per-tool custom integration. After MCP, the agent fetches a tool list from the server on connect and uses it. CompletionKit's MCP server is one of those servers. That is the entire framing; everything below is configuration.
Connecting Claude Code
Claude Code talks to MCP servers over streamable HTTP. One command registers the connection on the machine you run the agent from.
- Create a token. In CompletionKit, go to Settings then API tokens, click New token, name it after the machine you're connecting (so it is revokable per device), and copy the token before navigating away. You see it once.
- Register the server. Run, in your terminal:
Swapclaude mcp add completion-kit \ --transport http \ --url https://completionkit.com/orgs/your-org-slug/mcp \ --header "Authorization: Bearer YOUR_TOKEN"your-org-slugfor the slug in your dashboard URL (the part after/orgs/) andYOUR_TOKENfor the one you just copied. - Confirm the connection. In a Claude Code session, ask the agent to list its MCP tools, or run
/mcp. You should seecompletion-kitlisted and a tool count in the dozens. If the count is zero, the connection is wrong; go to the troubleshooting block below.
Connecting Cursor
Cursor reads MCP servers from a JSON config file. The config can live globally at ~/.cursor/mcp.json or per project at .cursor/mcp.json in the repo root. Per project is usually what you want, because the token in it is scoped to that organization.
- Create a token. Same as for Claude Code: Settings then API tokens in CompletionKit, copy the value.
- Write the config. Create
.cursor/mcp.jsonat the root of the project (or edit the global file) and add:
Swap the slug and the token as above. If you already have other MCP servers in the file, add{ "mcpServers": { "completion-kit": { "url": "https://completionkit.com/orgs/your-org-slug/mcp", "headers": { "Authorization": "Bearer YOUR_TOKEN" } } } }completion-kitas another entry insidemcpServers. - Reload Cursor. Open Settings then MCP; you should see
completion-kitin the list with a green status and a tool count. Click it to expand the tools and confirm the names match the reference table below.
For the deep protocol details (headers, the initialize handshake, error codes), the API reference renders a copy of these snippets pre-filled with your slug and a real token, plus the full REST equivalents for the same endpoints.
The tool reference
Every tool the CompletionKit MCP server exposes today, grouped by resource. Names match the wire protocol exactly; arguments shown are the required ones plus the most useful optional ones. The full input schema (including every optional argument) lives in the API reference and is also returned by the server's tools/list call, which is what the agent reads on connect.
| Tool | What it does | Key arguments |
|---|---|---|
prompts_list | List all prompts in the organization | none |
prompts_get | Get one prompt by id | id |
prompts_create | Create a new prompt | name, template, llm_model |
prompts_update | Update a prompt, or, if it has runs against it, create a new version descended from the current one | id, fields to change |
prompts_delete | Delete a prompt | id |
prompts_publish | Mark a prompt version as the current one | id |
runs_list | List all runs | none |
runs_get | Get one run by id, including aggregate scores | id |
runs_create | Create a run record. Omit prompt_id with output_column for a judge-only run | name, prompt_id, dataset_id, metric_ids |
runs_update | Update a run's dataset, judge model, or metrics before generation | id |
runs_delete | Delete a run | id |
runs_generate | Generate responses for a run using its prompt and dataset, then score them | id |
responses_list | List per-row responses for a run, including each row's score and rationale | run_id |
responses_get | Get one row's response, score, and rationale | run_id, id |
datasets_list | List all datasets | none |
datasets_get | Get one dataset by id | id |
datasets_create | Create a dataset from raw CSV data | name, csv_data |
datasets_update | Replace a dataset's CSV or rename it | id |
datasets_delete | Delete a dataset | id |
metrics_list | List all metrics | none |
metrics_get | Get one metric by id | id |
metrics_create | Create a metric with a rubric (1–5 bands) | name, instruction, rubric_bands |
metrics_update | Update a metric's instruction or rubric (creates a new MetricVersion) | id |
metrics_delete | Delete a metric | id |
metric_groups_list | List all metric groups | none |
metric_groups_get | Get a metric group by id | id |
metric_groups_create | Create a group of metrics that can be applied to a run together | name, metric_ids |
metric_groups_update | Update a metric group | id |
metric_groups_delete | Delete a metric group | id |
metric_versions_list | List every version of a metric, newest first | metric_id |
metric_versions_publish | Make a specific metric version the current one (also used to revert to an older one) | metric_version_id |
metric_versions_dismiss | Destroy a draft metric version (published versions are refused) | metric_version_id |
provider_credentials_list | List stored provider credentials (API keys are not exposed) | none |
provider_credentials_get | Get a provider credential record | id |
provider_credentials_create | Store a provider API key for OpenAI, Anthropic, Ollama, or OpenRouter | provider, api_key |
provider_credentials_update | Update a stored provider credential | id |
provider_credentials_delete | Delete a provider credential | id |
tags_list | List all tags | none |
tags_get | Get a tag by id | id |
tags_create | Create a tag (color is auto-assigned) | name |
tags_update | Rename a tag | id, name |
tags_delete | Delete a tag and remove it from every linked resource | id |
agreements_list | List human verdicts on judge scores, optionally filtered by run, response, metric, or author | none required |
agreements_create | Record a human verdict (agree, disagree, borderline) on the judge's score for one row | run_id, response_id, metric_id, verdict |
judges_suggest | Ask the model to rewrite the metric's instruction as N draft variants targeted at recent disagreements | metric_id |
judges_replay | Re-judge an existing dataset column with the current judge (a judge-only run) | name, metric_id, dataset_id, judge_model |
judges_compare | Compare two metric versions' calibration agreement side by side, with a recommendation | metric_id, metric_version_a_id, metric_version_b_id |
A first task to try
Paste the following brief into Claude Code or Cursor once the connection is up. It exercises runs_list, runs_get, and responses_list end to end, so by the time the agent reports back, you'll know three things at once: the auth works, the tools are visible, and the data is coming through with per-row detail. It does not write anything, so it is safe to run before you trust the agent with revisions.
Use the
completion-kitMCP server to find my most recent run. Show me its name, its average score per metric, and the three lowest-scoring rows with the judge's rationale for each. Do not create or modify anything.
When this works, swap the brief for one that actually iterates. The conceptual companion piece, the automated eval loop, has an example two-sentence brief that hands the loop to the agent and tells it when to stop. For background on what the scores mean and why they are trustworthy, see LLM-as-a-judge; for what a run and a metric actually are, see what is prompt evaluation.
Troubleshooting
Four mistakes account for almost every first-connection failure. In order of how often we see them:
- The Bearer token is wrong or revoked. Symptom: the agent reports a 401, or Claude Code shows the server as failed in
/mcpoutput, or Cursor's MCP panel shows it as red. Fix: regenerate the token under Settings then API tokens, paste the new value into the config, restart the agent. Tokens never appear in URLs and are never echoed back on subsequent views, so a copy-paste mistake here is the single most common source of trouble. - The endpoint URL is wrong. Symptom: the connection succeeds but the server returns a JSON-RPC error or a 404. Almost always the slug between
/orgs/and/mcpis wrong, or someone wrotehttps://completionkit.com/mcpwith no organization in it (there isn't a global endpoint; every endpoint is scoped to an organization). Fix: visit your dashboard, copy the slug straight out of the URL, paste it into the config. - The tool list is empty. Symptom: the server is listed as connected but the agent says it has no tools, or
/mcpshows a tool count of zero. Almost always the client hasn't completed the MCPinitializehandshake yet (older clients did it lazily). Fix: restart the agent, or in Claude Code start a new session; the handshake fires on session start and the tool list populates immediately afterwards. - No organization context. Symptom: the agent can list its tools but every call returns "session not initialized" or 400. The server scopes all data by organization, and an endpoint without a valid org slug has no organization to scope to. Fix: same as #2 above, get the slug right; this and "endpoint URL is wrong" are siblings.
If something fails outside of these four, the API reference page in your organization renders the same curl examples the MCP server uses internally; running one of those by hand confirms whether the auth and the URL are good independent of the agent.
FAQ
Where do I get the MCP endpoint URL?
Every CompletionKit organization has its own MCP endpoint at https://completionkit.com/orgs/your-org-slug/mcp. The slug is the one in your dashboard URL after /orgs/. The API reference page in your organization renders the full URL pre-filled with your slug, alongside a Bearer token you can create under Settings then API.
Do I need a paid plan to use the MCP server?
No. The MCP server is available on the free tier. The free plan has a monthly run quota, and an agent driving the loop spends from the same quota a human would, so a paid plan matters once the agent is iterating frequently. The connection itself, the tool list, and read-only calls cost nothing.
Does the agent see my prompt history?
Only the parts you let it list. The tools the server exposes are scoped to your organization and require a Bearer token tied to it. An agent that has the token can call prompts_list, runs_list, datasets_list, and so on, and read whatever those return. It cannot see prompts in other organizations, and it cannot read past the Bearer token's permissions. Revoking the token under Settings then API cuts the agent off immediately.
What happens if I run the same create_revision call twice?
CompletionKit does not have a tool literally named create_revision. New revisions land through prompts_update on a prompt that already has runs against it: the update is captured as a new immutable version rather than an in-place edit. Calling prompts_update twice with the same arguments creates two distinct versions, each with its own id, each pointing back at the same parent. That is not a bug, it is the versioning model. If the agent re-issues the same call by accident, you have one extra unpublished version sitting in the prompt's history and nothing else changes.
Connect an agent free Read the API reference
Built by Homemade Software. Disagree with any of this? support@completionkit.com.