May 25, 2026 · The CompletionKit team

Let your coding agent improve your prompts: the automated eval loop

Automated prompt optimization is the loop you used to run by hand, run by an agent instead: draft a prompt, run it on a dataset, read the scores and the failing rows, revise, re-run, on and on until the score clears a bar you set. You bring the bar (the metrics, the rubric, the score that counts as good enough); the agent brings the iteration. It is hill climbing on a prompt-scored landscape, and the agent is the one taking the steps.

It works because the loop has always been mechanical. You stare at low-scoring rows, you guess at the change, you re-run, you check whether the average moved. None of those steps require taste once the bar is written down. The taste lives in the rubric. Everything past the rubric is a search problem, and search problems are exactly what an agent in a tool-using loop can grind through faster than a person can.

What changed in the last year is the connector. A coding agent in Claude Code or Cursor can now call the same tools your humans do: create a prompt revision, kick off a run, read the scores and per-row rationales, propose a new revision, repeat. The agent does not need to understand your product; it needs to understand the score it is climbing and the moves available to it. CompletionKit exposes both as an MCP server. You point the agent at the server, name the bar, and walk away.

Where the idea came from

Two threads. The first is older than modern AI. The second is barely eighteen months old.

The older thread is hill climbing. It is among the oldest local-search ideas in artificial intelligence, and Stuart Russell and Peter Norvig give it a section of its own in "Artificial Intelligence: A Modern Approach" (chapter 4, "Search in Complex Environments"). The shape is the one your gut already runs when you tune a prompt: stand at the current solution, look at the neighbours, take a step toward the better one, repeat until no neighbour improves. The classic warning attached is just as old: hill climbing finds a local maximum, not necessarily the global one, which is why every textbook treatment recommends restarts from different starting points. A coding agent iterating on a prompt is hill climbing on a prompt-shaped landscape, with the eval score as the height. The same warnings apply, and the same fixes (start from more than one draft, accept temporary score drops to escape a plateau) port over directly.

The specifically prompt-shaped version of this idea has a name and a date. DSPy, the "programming, not prompting" framework out of Stanford NLP, started in February 2022, shipped as DSP in December 2022, and was renamed DSPy in October 2023. Its central move was to take prompt optimization out of the human's hands: you describe what the program should do declaratively, hand DSPy a metric and a small dataset, and an optimizer searches the space of prompts that score well against that metric. The lineage runs straight from there to the agent loop. What DSPy did with a Python compiler and a fixed set of optimizer algorithms, a coding agent in 2026 does with a chat loop and a tool-using context window: same search, different driver.

The newer thread is MCP. The Model Context Protocol was announced by Anthropic on November 25, 2024, built by Anthropic engineers David Soria Parra and Justin Spahr-Summers, and open-sourced alongside SDKs and a handful of reference servers (Google Drive, Slack, GitHub, Postgres, Puppeteer). What MCP did was give agents a standard way to call into systems they did not previously know about. Before MCP, "let the agent drive my eval tool" required a custom integration per agent and per tool. After MCP, an MCP-speaking agent (Claude Code, Cursor, others) can drive any MCP-speaking server without either side knowing about the other in advance. The eval loop, the kind of mechanical search problem an agent should be doing, finally had a connector that fit.

The shape of the loop

Strip the loop down and four moves do the work. The agent runs them in a circle until one of two stop conditions fires.

Revise the prompt. The agent reads the latest version, looks at the rows that scored lowest in the most recent run, and proposes a new version of the prompt. The proposal is grounded in the per-row rationales the judge wrote, not in generic "be more specific" advice. The new version becomes a new revision in the prompt store, with a parent pointer back to the one it descended from, so the history is traceable later.
Run it on your data. The agent kicks off a run of the new revision over the dataset. The dataset is held fixed (it is the measuring stick), so every revision is scored on the same inputs.
Read the scores. The judge scores each row against the rubric. The agent receives both the average and the per-row breakdown: which rows dropped, which rows climbed, where the rationales point.
Inspect the misses. The agent reads the failing rows in detail (the input, the output, and the rationale), groups them where it can, and uses that grouping as the input to the next revision.

The loop stops when one of two things happens. Either the score clears the bar you set, in which case the agent stops and hands you the revision to review. Or the score plateaus across several iterations, in which case the agent stops and reports what it tried, so you can decide whether to restart from a different draft or to change the rubric.

It is worth being precise about what "you set the bar" means here. The bar is the part of the loop the agent cannot do for you. Metrics (what to score on), a rubric (what each point on the 1–5 scale means), and a target score (the number a revision has to clear) are all your judgment calls. They encode what good means for your product, and a number alone cannot deduce that. Once those three are written down, the rest is search, and search is what the agent is for. The earlier post on LLM-as-a-judge covers how to write a rubric the loop can actually climb, and how to build a prompt eval dataset covers how to assemble the inputs the loop runs on.

Four moves in a circle. The agent rides the loop; you wrote the rubric and the target score that decide when it stops.

What the agent is actually doing at each step

The temptation is to treat the agent as a black box that "improves the prompt". It isn't a black box. It is a coding agent with a tool-using context window, and it can only see what you give it through the MCP server. The quality of the loop comes from giving it the right surface area.

Read the per-row rationales, not just the average. The single most useful thing the agent can pull through MCP is the judge's sentence-of-why for every low-scoring row. The average tells the agent that something is wrong; the rationales tell it what. A loop that only sees the average gropes; a loop that sees the rationales reasons.
Cluster failures before revising. If five rows dropped and the rationales all say "the model refused to commit to a delivery date", that is one revision, not five. A good agent groups before it proposes. A bad agent writes a one-off patch for every failing row and produces a prompt that has compounded into nothing.
Make small steps. A single revision should change one thing. Big revisions are hard to interpret when the score moves: the agent (and you) lose the ability to say which edit caused the move. This is the same discipline that makes prompt regression testing useful, applied at the iteration scale rather than the release scale.
Stop on plateau, not on time. A fixed iteration budget ("run ten passes and pick the best") wastes work on a prompt that converged at pass three. A plateau condition ("stop if the score hasn't improved by 0.1 in the last three passes") spends the budget where it matters.

The earlier post on prompt versioning covers why each of these revisions has to become an immutable, addressable artifact rather than an in-place edit; the short version is that the loop is only auditable if every step in it left a named revision behind. Without that, "the agent improved the prompt" is a sentence with nothing to point at.

Setting it up with Claude Code or Cursor

Concretely, in CompletionKit:

Create the prompt and the dataset. A starting draft of the prompt, a CSV of real inputs (production logs, support tickets, hand-curated edge cases), and the metrics and rubric. This is the bar. For the wider context on assembling the dataset, see the post on what prompt evaluation actually is.
Connect the MCP server. Each organization in CompletionKit gets an MCP endpoint. Drop the endpoint into Claude Code's or Cursor's MCP configuration and the agent now has a set of tools: create a prompt revision, start a run, read a run's scores and per-row rationales, list datasets and metrics, publish a revision. The setup guide walks both clients through it, token and all.
Tell the agent what to do. A two-sentence brief is usually enough: "Improve the support-reply prompt so that the acknowledges-order-number metric scores at least 4.5 on dataset v3. Stop when you clear the bar or when the score plateaus across three iterations." The agent picks up the loop from there.
Review the result. When the agent stops, it hands back the latest revision, the score it cleared, and the diff against the draft you started with. You review the diff, decide whether to publish, and roll it out. The agent did the work; the judgment is still yours.

For the protocol details (server URL, tool list, auth), the docs walk through it step by step.

One pass of the loop, with real numbers

Here's what a single pass looks like in practice, from a run in our own workspace. The starting prompt was the generic kind everyone writes first: "You are a support agent for an online store. Write a short, friendly reply to the customer's message." Against an eight-row dataset of customer messages, each containing an order number, it scored 3.88 on an acknowledges-order-number metric, and the judge's rationale on the worst row said exactly why: the reply never referenced the order at all.

The revision changed one line. Adding "Acknowledge their specific order." took the next run to 5.0 on the same dataset with the same judge, and the reply on the formerly 1-star row now opens by quoting order #58219. One sentence, derived from reading the rationale on a failing row rather than from squinting at fifty outputs. That is the entire mechanic this post is about, at its smallest possible size.

(Real runs, June 12, 2026. Generation by gpt-4.1-mini, judged by claude-sonnet-4-6.)

Where it goes wrong, and how to keep it useful

Three failure modes are worth naming because they show up in every agent-driven optimization loop, prompts or otherwise.

Reward hacking the judge. The agent is climbing a score, and the score is the judge's. If the rubric is sloppy (vague language, no anchored points), the agent will find revisions that score well by gaming the rubric rather than improving the output. The fix is the same as for any LLM-as-a-judge setup: anchor the scale, calibrate against humans once, and treat the rubric as a versioned artifact you keep stable across runs. A loose rubric is not just noisy; it is a thing the agent will optimize.
Overfitting to the dataset. The dataset is also a measuring stick the agent can game, by writing rules that handle exactly the rows in front of it and nothing else. The portfolio approach from the eval-dataset post helps here: a dataset that overrepresents adversarial and out-of-distribution rows makes overfitting expensive, because a revision that handles only the easy rows scores badly on the hard ones.
Compounding instructions. Run a loop long enough and the prompt grows: each iteration bolts on a rule for a row that failed last pass, until the prompt is a list of corner cases and the model is spending its attention on bookkeeping. A useful counter is to ask the agent, every few iterations, to simplify rather than extend. Sometimes the right next revision is shorter than the current one, not longer.

None of these are reasons to avoid the loop. They are reasons to set it up the way you would set up any optimization process: with a measuring stick you trust, restarts when you hit a plateau, and an occasional pruning pass.

You don't always need to generate

A worthwhile shortcut: if you already have outputs (production logs, hand-curated examples, results from a previous model), the agent does not have to generate anything to start scoring. A judge-only run, covered in its own post, scores any column in the dataset against your rubric with no generation step. That cuts the loop in half for the first pass: the agent sees scored rows immediately, proposes a revision, and only then does the loop start generating. It is the fastest way to put a number on a prompt you inherited and didn't write, and the cheapest way to give the agent a real starting point instead of a cold one.

The point

The loop is not new. People have been hill-climbing prompts for as long as there have been prompts. What is new is that the loop now has a connector (MCP), the agent doing the climbing now has the tools to run it (Claude Code, Cursor), and the artifact at the end of the loop is a named, versioned revision you can publish without a redeploy. Spend your time on the rubric. Let the agent spend its time on the search.

FAQ

What is automated prompt optimization?

The practice of letting a tool, usually a coding agent in a loop, iterate on a prompt by running it against a fixed dataset, reading the judge's scores and rationales, revising, and re-running, until the score clears a bar you set. You define the metrics and the target score; the tool does the iteration. The shape is hill climbing on a prompt-scored landscape.

How is this different from DSPy?

DSPy is the same idea with a Python compiler and a fixed set of optimizer algorithms: you describe the program declaratively, hand DSPy a metric and a dataset, and it searches over prompts. A coding agent in Claude Code or Cursor does the same search through a chat loop and a tool-using context window. The driver and the surface area are different; the underlying activity (search a prompt landscape against a metric) is the same.

Why MCP instead of a Python SDK?

Because the agent already speaks MCP. The Model Context Protocol, open-sourced by Anthropic in November 2024, lets an MCP-speaking agent drive any MCP-speaking server without either side knowing about the other in advance. Before MCP, hooking an agent into an eval tool meant a per-agent, per-tool custom integration. After MCP, you point the agent at the server, and it has the tool list it needs.

Does the agent ever stop on its own?

Yes. The two natural stop conditions are score-cleared-the-bar (the agent hands you the winning revision) and score-plateaued-across-the-last-several-iterations (the agent reports what it tried, and you decide whether to restart from a different draft or to adjust the rubric). A fixed iteration budget can wrap both as a safety net, but the plateau condition is what makes the spend efficient.

Let an agent run the loop free Read the docs

Built by Homemade Software. Disagree with any of this? support@completionkit.com or r/completionkit.

← All posts