Skip to content

MCP server for prompt evals: setting up Claude Code and Cursor

To let a coding agent drive evals against your prompts, point its MCP client at your CompletionKit organization's MCP endpoint. The endpoint lives at https://completionkit.com/orgs/your-org-slug/mcp, authenticates with a Bearer token you create under Settings then API, and exposes a toolbox the agent can call from inside Claude Code or Cursor: list and revise prompts, kick off runs, read per-row scores and rationales, replay the judge against existing outputs. The rest of this piece is the full setup: the Claude Code config, the Cursor config, the tool reference, a first task that exercises three tools end-to-end, and the four mistakes that bite on the first connection.

If you have not read the conceptual companion to this piece, let your coding agent improve your prompts: the automated eval loop covers why you would want this and what the agent does once connected. This post is the wiring, not the why.

Where MCP came from

The Model Context Protocol was announced by Anthropic on November 25, 2024, built by Anthropic engineers David Soria Parra and Justin Spahr-Summers, and open-sourced alongside SDKs and a handful of reference servers (Google Drive, Slack, GitHub, Postgres, Puppeteer). The point of the protocol is that an MCP-speaking client (Claude Code, Cursor, others) and an MCP-speaking server can find each other at runtime without either side knowing about the other in advance. Before MCP, hooking an agent into an eval tool was a per-agent, per-tool custom integration. After MCP, the agent fetches a tool list from the server on connect and uses it. CompletionKit's MCP server is one of those servers. That is the entire framing; everything below is configuration.

Connecting Claude Code

Claude Code talks to MCP servers over streamable HTTP. One command registers the connection on the machine you run the agent from.

  1. Create a token. In CompletionKit, go to Settings then API tokens, click New token, name it after the machine you're connecting (so it is revokable per device), and copy the token before navigating away. You see it once.
  2. Register the server. Run, in your terminal:
    claude mcp add completion-kit \
      --transport http \
      --url https://completionkit.com/orgs/your-org-slug/mcp \
      --header "Authorization: Bearer YOUR_TOKEN"
    
    Swap your-org-slug for the slug in your dashboard URL (the part after /orgs/) and YOUR_TOKEN for the one you just copied.
  3. Confirm the connection. In a Claude Code session, ask the agent to list its MCP tools, or run /mcp. You should see completion-kit listed and a tool count in the dozens. If the count is zero, the connection is wrong; go to the troubleshooting block below.

Connecting Cursor

Cursor reads MCP servers from a JSON config file. The config can live globally at ~/.cursor/mcp.json or per project at .cursor/mcp.json in the repo root. Per project is usually what you want, because the token in it is scoped to that organization.

  1. Create a token. Same as for Claude Code: Settings then API tokens in CompletionKit, copy the value.
  2. Write the config. Create .cursor/mcp.json at the root of the project (or edit the global file) and add:
    {
      "mcpServers": {
        "completion-kit": {
          "url": "https://completionkit.com/orgs/your-org-slug/mcp",
          "headers": {
            "Authorization": "Bearer YOUR_TOKEN"
          }
        }
      }
    }
    
    Swap the slug and the token as above. If you already have other MCP servers in the file, add completion-kit as another entry inside mcpServers.
  3. Reload Cursor. Open Settings then MCP; you should see completion-kit in the list with a green status and a tool count. Click it to expand the tools and confirm the names match the reference table below.

For the deep protocol details (headers, the initialize handshake, error codes), the API reference renders a copy of these snippets pre-filled with your slug and a real token, plus the full REST equivalents for the same endpoints.

The tool reference

Every tool the CompletionKit MCP server exposes today, grouped by resource. Names match the wire protocol exactly; arguments shown are the required ones plus the most useful optional ones. The full input schema (including every optional argument) lives in the API reference and is also returned by the server's tools/list call, which is what the agent reads on connect.

ToolWhat it doesKey arguments
prompts_listList all prompts in the organizationnone
prompts_getGet one prompt by idid
prompts_createCreate a new promptname, template, llm_model
prompts_updateUpdate a prompt, or, if it has runs against it, create a new version descended from the current oneid, fields to change
prompts_deleteDelete a promptid
prompts_publishMark a prompt version as the current oneid
runs_listList all runsnone
runs_getGet one run by id, including aggregate scoresid
runs_createCreate a run record. Omit prompt_id with output_column for a judge-only runname, prompt_id, dataset_id, metric_ids
runs_updateUpdate a run's dataset, judge model, or metrics before generationid
runs_deleteDelete a runid
runs_generateGenerate responses for a run using its prompt and dataset, then score themid
responses_listList per-row responses for a run, including each row's score and rationalerun_id
responses_getGet one row's response, score, and rationalerun_id, id
datasets_listList all datasetsnone
datasets_getGet one dataset by idid
datasets_createCreate a dataset from raw CSV dataname, csv_data
datasets_updateReplace a dataset's CSV or rename itid
datasets_deleteDelete a datasetid
metrics_listList all metricsnone
metrics_getGet one metric by idid
metrics_createCreate a metric with a rubric (1–5 bands)name, instruction, rubric_bands
metrics_updateUpdate a metric's instruction or rubric (creates a new MetricVersion)id
metrics_deleteDelete a metricid
metric_groups_listList all metric groupsnone
metric_groups_getGet a metric group by idid
metric_groups_createCreate a group of metrics that can be applied to a run togethername, metric_ids
metric_groups_updateUpdate a metric groupid
metric_groups_deleteDelete a metric groupid
metric_versions_listList every version of a metric, newest firstmetric_id
metric_versions_publishMake a specific metric version the current one (also used to revert to an older one)metric_version_id
metric_versions_dismissDestroy a draft metric version (published versions are refused)metric_version_id
provider_credentials_listList stored provider credentials (API keys are not exposed)none
provider_credentials_getGet a provider credential recordid
provider_credentials_createStore a provider API key for OpenAI, Anthropic, Ollama, or OpenRouterprovider, api_key
provider_credentials_updateUpdate a stored provider credentialid
provider_credentials_deleteDelete a provider credentialid
tags_listList all tagsnone
tags_getGet a tag by idid
tags_createCreate a tag (color is auto-assigned)name
tags_updateRename a tagid, name
tags_deleteDelete a tag and remove it from every linked resourceid
agreements_listList human verdicts on judge scores, optionally filtered by run, response, metric, or authornone required
agreements_createRecord a human verdict (agree, disagree, borderline) on the judge's score for one rowrun_id, response_id, metric_id, verdict
judges_suggestAsk the model to rewrite the metric's instruction as N draft variants targeted at recent disagreementsmetric_id
judges_replayRe-judge an existing dataset column with the current judge (a judge-only run)name, metric_id, dataset_id, judge_model
judges_compareCompare two metric versions' calibration agreement side by side, with a recommendationmetric_id, metric_version_a_id, metric_version_b_id

A first task to try

Paste the following brief into Claude Code or Cursor once the connection is up. It exercises runs_list, runs_get, and responses_list end to end, so by the time the agent reports back, you'll know three things at once: the auth works, the tools are visible, and the data is coming through with per-row detail. It does not write anything, so it is safe to run before you trust the agent with revisions.

Use the completion-kit MCP server to find my most recent run. Show me its name, its average score per metric, and the three lowest-scoring rows with the judge's rationale for each. Do not create or modify anything.

When this works, swap the brief for one that actually iterates. The conceptual companion piece, the automated eval loop, has an example two-sentence brief that hands the loop to the agent and tells it when to stop. For background on what the scores mean and why they are trustworthy, see LLM-as-a-judge; for what a run and a metric actually are, see what is prompt evaluation.

Troubleshooting

Four mistakes account for almost every first-connection failure. In order of how often we see them:

  1. The Bearer token is wrong or revoked. Symptom: the agent reports a 401, or Claude Code shows the server as failed in /mcp output, or Cursor's MCP panel shows it as red. Fix: regenerate the token under Settings then API tokens, paste the new value into the config, restart the agent. Tokens never appear in URLs and are never echoed back on subsequent views, so a copy-paste mistake here is the single most common source of trouble.
  2. The endpoint URL is wrong. Symptom: the connection succeeds but the server returns a JSON-RPC error or a 404. Almost always the slug between /orgs/ and /mcp is wrong, or someone wrote https://completionkit.com/mcp with no organization in it (there isn't a global endpoint; every endpoint is scoped to an organization). Fix: visit your dashboard, copy the slug straight out of the URL, paste it into the config.
  3. The tool list is empty. Symptom: the server is listed as connected but the agent says it has no tools, or /mcp shows a tool count of zero. Almost always the client hasn't completed the MCP initialize handshake yet (older clients did it lazily). Fix: restart the agent, or in Claude Code start a new session; the handshake fires on session start and the tool list populates immediately afterwards.
  4. No organization context. Symptom: the agent can list its tools but every call returns "session not initialized" or 400. The server scopes all data by organization, and an endpoint without a valid org slug has no organization to scope to. Fix: same as #2 above, get the slug right; this and "endpoint URL is wrong" are siblings.

If something fails outside of these four, the API reference page in your organization renders the same curl examples the MCP server uses internally; running one of those by hand confirms whether the auth and the URL are good independent of the agent.

FAQ

Where do I get the MCP endpoint URL?

Every CompletionKit organization has its own MCP endpoint at https://completionkit.com/orgs/your-org-slug/mcp. The slug is the one in your dashboard URL after /orgs/. The API reference page in your organization renders the full URL pre-filled with your slug, alongside a Bearer token you can create under Settings then API.

Do I need a paid plan to use the MCP server?

No. The MCP server is available on the free tier. The free plan has a monthly run quota, and an agent driving the loop spends from the same quota a human would, so a paid plan matters once the agent is iterating frequently. The connection itself, the tool list, and read-only calls cost nothing.

Does the agent see my prompt history?

Only the parts you let it list. The tools the server exposes are scoped to your organization and require a Bearer token tied to it. An agent that has the token can call prompts_list, runs_list, datasets_list, and so on, and read whatever those return. It cannot see prompts in other organizations, and it cannot read past the Bearer token's permissions. Revoking the token under Settings then API cuts the agent off immediately.

What happens if I run the same create_revision call twice?

CompletionKit does not have a tool literally named create_revision. New revisions land through prompts_update on a prompt that already has runs against it: the update is captured as a new immutable version rather than an in-place edit. Calling prompts_update twice with the same arguments creates two distinct versions, each with its own id, each pointing back at the same parent. That is not a bug, it is the versioning model. If the agent re-issues the same call by accident, you have one extra unpublished version sitting in the prompt's history and nothing else changes.

Connect an agent free   Read the API reference

Built by Homemade Software. Disagree with any of this? support@completionkit.com.

← All posts