Skip to content

MCP server for prompt evals: setting up Claude Code and Cursor

To let a coding agent drive evals against your prompts, point its MCP client at your CompletionKit organization's MCP endpoint. The endpoint lives at https://completionkit.com/orgs/your-org-slug/mcp, authenticates with a Bearer token you create under Settings then API tokens, and exposes a toolbox the agent can call from inside Claude Code or Cursor: list and revise prompts, kick off runs, read per-row scores and rationales, replay the judge against existing outputs.

Let your coding agent improve your prompts: the automated eval loop covers why you would want this and what the agent does once connected. This post is the wiring.

Where MCP came from

The Model Context Protocol was announced by Anthropic on November 25, 2024, built by Anthropic engineers David Soria Parra and Justin Spahr-Summers, and open-sourced alongside SDKs and a handful of reference servers (Google Drive, Slack, GitHub, Postgres, Puppeteer). The point of the protocol is that an MCP-speaking client (Claude Code, Cursor, others) and an MCP-speaking server can find each other at runtime without either side knowing about the other in advance. Before MCP, hooking an agent into an eval tool was a per-agent, per-tool custom integration. After MCP, the agent fetches a tool list from the server on connect and uses it. CompletionKit's MCP server is one of those servers.

Connecting Claude Code

Claude Code talks to MCP servers over streamable HTTP. One command registers the connection.

  1. Create a token. In CompletionKit, go to Settings then API tokens, click New token, name it after the machine you're connecting (so it is revokable per device), and copy the token now. You see it once.
  2. Register the server. Run, in your terminal:
    claude mcp add completion-kit \
      --transport http \
      https://completionkit.com/orgs/your-org-slug/mcp \
      --header "Authorization: Bearer YOUR_TOKEN"
    
    Swap your-org-slug for the slug in your dashboard URL (the part after /orgs/) and YOUR_TOKEN for the one you just copied.
  3. Confirm the connection. In a Claude Code session, ask the agent to list its MCP tools, or run /mcp. You should see completion-kit listed and a tool count in the dozens. If the count is zero, the connection is wrong: see troubleshooting below.

Connecting Cursor

Cursor reads MCP servers from a JSON config file. The config can live globally at ~/.cursor/mcp.json or per project at .cursor/mcp.json in the repo root. Prefer per project, since the token in it is scoped to that organization.

  1. Create a token. Same as for Claude Code: Settings then API tokens in CompletionKit, copy the value.
  2. Write the config. Create .cursor/mcp.json at the root of the project (or edit the global file) and add:
    {
      "mcpServers": {
        "completion-kit": {
          "url": "https://completionkit.com/orgs/your-org-slug/mcp",
          "headers": {
            "Authorization": "Bearer YOUR_TOKEN"
          }
        }
      }
    }
    
    Swap the slug and the token as above. If you already have other MCP servers in the file, add completion-kit as another entry inside mcpServers.
  3. Reload Cursor. Open Settings then MCP; you should see completion-kit in the list with a green status and a tool count. Click it to expand the tools and confirm its tools are listed.

For the deep protocol details (headers, the initialize handshake, error codes), the API reference renders these snippets pre-filled with your slug and a real token, plus the full REST equivalents.

What the toolbox covers

You never type these names yourself. The agent fetches the full, current list over tools/list the moment it connects, and the API reference documents every tool with its arguments. What is worth having up front is the shape of it, the eleven resource families the agent can read and write.

ResourceWhat the agent can do
Prompts prompts_*List, read, create, and revise prompts. A change to a prompt that already has runs creates a new draft version, promoted to current with prompts_publish; a prompt with no runs is updated in place. prompts_suggest_improvement drafts a better template from a run's scores and judge feedback.
Runs runs_*List, read, create, and delete runs, then generate responses and score them. A run can be judge-only, scoring outputs you already have.
Responses responses_*Read each row's output with its score and the judge's rationale.
Datasets datasets_*List, read, update, and delete datasets, and create them from CSV, inline or fetched from a URL with datasets_create_from_url.
Metrics metrics_*List, read, create, update, and delete metrics, each a 1–5 rubric. metrics_suggest_variants drafts rubric rewrites from recent disagreements.
Metric versions metric_versions_*List a metric's versions, publish or revert to one, and dismiss a draft.
Metric groups metric_groups_*Bundle metrics so they apply to a run together.
Provider credentials provider_credentials_*Store and manage API keys for OpenAI, Anthropic, Ollama, and OpenRouter.
Tags tags_*Create, rename, and apply tags across resources.
Agreements agreements_*List and record human verdicts (agree, disagree, borderline) on the judge's scores.
Judges judges_*Replay the judge on existing outputs and compare two metric versions' agreement.

A first task to try

Paste the following brief into Claude Code or Cursor once the connection is up. It exercises runs_list, runs_get, and responses_list end to end: when the agent reports back you'll know the auth works, the tools are visible, and the data comes through with per-row detail. It writes nothing, so it is safe to run before you trust the agent with revisions.

Use the completion-kit MCP server to find my most recent run. Show me its name, its average score per metric (computed from the per-row responses), and the three lowest-scoring rows with the judge's rationale for each. Do not create or modify anything.

When this works, swap the brief for one that iterates. The automated eval loop has an example two-sentence brief that hands the loop to the agent and tells it when to stop. For background on what the scores mean and why they are trustworthy, see LLM-as-a-judge; for what a run and a metric actually are, see what is prompt evaluation.

Troubleshooting

Three mistakes account for almost every first-connection failure, most common first:

  1. The Bearer token is wrong or revoked. Symptom: the agent reports a 401, or Claude Code shows the server as failed in /mcp output, or Cursor's MCP panel shows it as red. Fix: regenerate the token under Settings then API tokens, paste the new value into the config, restart the agent. Tokens never appear in URLs and are shown only once, so a copy-paste slip here is the most common failure.
  2. The endpoint URL is wrong. Symptom: the connection succeeds but the server returns a JSON-RPC error or a 404. Almost always the slug between /orgs/ and /mcp is wrong, or someone wrote https://completionkit.com/mcp with no organization in it. There is no global endpoint; every endpoint is scoped to an organization. Fix: visit your dashboard, copy the slug straight out of the URL, paste it into the config.
  3. The tool list is empty. Symptom: the server is listed as connected but the agent says it has no tools, or /mcp shows a tool count of zero. Almost always the client hasn't completed the MCP initialize handshake yet (older clients did it lazily). Fix: restart the agent, or in Claude Code start a new session; the handshake fires on session start and the tool list populates immediately afterwards.

If something fails outside these three, the API reference page renders curl examples for the same endpoints; running one by hand confirms whether the auth and URL are good, independent of the agent.

FAQ

Where do I get the MCP endpoint URL?

Every CompletionKit organization has its own MCP endpoint at https://completionkit.com/orgs/your-org-slug/mcp. The slug is the one in your dashboard URL after /orgs/. The API reference page in your organization renders the full URL pre-filled with your slug, alongside a Bearer token you can create under Settings then API.

Do I need a paid plan to use the MCP server?

No. The MCP server is available on the free tier. The free plan has a monthly run quota, and an agent driving the loop spends from it like a human would, so a paid plan matters once the agent iterates often. The connection, the tool list, and read-only calls cost nothing.

Does the agent see my prompt history?

Only the parts you let it list. The tools the server exposes are scoped to your organization and require a Bearer token tied to it. An agent with the token can call prompts_list, runs_list, datasets_list, and so on, and read what they return. It cannot see anything outside the organization the token belongs to. Revoking the token under Settings then API tokens cuts the agent off immediately.

Does prompts_update overwrite my prompt or create a new version?

On a prompt that already has runs, prompts_update creates a new immutable version rather than editing in place, and that version starts as a draft. It does not become current until you promote it with prompts_publish, so an agent's edits get a review gate instead of shipping straight to production. The full history is kept, and a prompt with no runs yet is updated in place.

Connect an agent free   Read the API reference

Built by Homemade Software. Disagree with any of this? support@completionkit.com.

← All posts