Skip to content

Agent prompt testing: test the prompts inside your agents

The tooling for testing AI agents is arriving fast. Inngest's AgentKit plugs into a dev server that replays an agent run step by step, so you can watch every tool call and state change. Anthropic released Bloom and Petri, tools that audit a model for concerning behaviors across hundreds of generated scenarios. Both are real, both are useful, and both point at the agent from the outside.

But when an agent misfires in production, the cause is almost never the orchestration, and it is rarely some latent flaw in the base model. It is a prompt. The system prompt drifted when someone added an instruction. The tool-selection prompt is ambiguous about which tool to reach for. A sub-agent's grading rubric is looser than anyone realised. The loop did its job; the prompt it was running was wrong.

Agent prompt testing is scoring the prompts inside an agent against a fixed dataset of example inputs, the same discipline you would bring to any prompt. You lift one prompt out of the agent, gather inputs that represent what it actually sees, spell out what good looks like as a rubric or a handful of deterministic checks, and let the run score every row. Out comes a number you can set beside the last version's, so a one-line tweak to the system prompt stops being a hunch and starts being a measurement.

An agent is a stack of prompts

Strip the framework away and an agent is a handful of prompts run in a loop. The reason-and-act pattern that ReAct described in 2022 is still the shape of it: the model reads context, decides on an action, sees the result, and goes again. Around that loop sit prompts doing specific jobs. A system prompt sets the agent's role and its rules. A routing prompt picks the next tool. An extraction prompt turns a messy email into structured fields. A summarization prompt compresses a long thread before the next step. Sub-agents carry their own prompts and their own rubrics.

Every one of those is a prompt with an input and an output, which means every one of them is testable the ordinary way. You do not need the whole agent spun up to check whether the extraction prompt handles a forwarded email with three timezones in it. You need that prompt, a dataset of awkward inputs, and a metric. The framework is orchestration; the prompt is the part that actually has to be correct.

What testing one of those prompts looks like

Take an agent that books meetings. Somewhere inside it is a prompt whose only job is to read an inbound message and pull out the details: who is invited, how long, and which timezone the sender means. That prompt is where the agent quietly gets things wrong, and it is a prompt you can test on its own.

Collect a few dozen real messages, the messier the better: the ones with two timezones, the ones that say "sometime next week", the reply that only makes sense with the quoted thread underneath. Point a run at that dataset and grade the output with a mix of checks and a judge. A deterministic check confirms the output is valid JSON with the required fields. A second check confirms the timezone is one of the allowed values. A rubric metric, scored by an LLM judge, asks the harder question: did it pick the timezone the sender actually meant, not just a plausible one? Now a change to that extraction prompt has a before and after, and the row where it started guessing California time for a sender in London is right there in the diff.

You can also test the prompt on outputs the agent has already produced. If you have logs of what the extraction step returned in production, drop them into a column and score them with a judge-only run: no need to re-run the agent, just grade what it already did.

This is not agent behavior evaluation, and it is not observability

Three different things get filed under "testing agents", and it pays to keep them apart, because they answer different questions and you probably want all three.

Agent behavior evaluation asks whether the agent took a good path: did it pick the right tools, in the right order, recover when one failed, stay on task across turns. The unit here is a whole trajectory, not a single prompt and its output. This is the layer Inngest's AgentKit dev server lets you replay, tool call by tool call. It is real work, and it is hard, and it is not what agent prompt testing does.

Observability watches the live agent in production and reports what already happened: latency, cost, which runs errored, a trace of each one to debug. It is the rear-view mirror of the three, and you want one. What it cannot do is tell you, before anything ships, whether the prompt edit you made this morning is safe to release.

Agent prompt testing sits at the layer underneath the behavior and before production: the individual prompts and the outputs they produce, scored against a fixed dataset you control. It is the test drive to observability's rear-view mirror, and the component test to behavior evaluation's integration test. A well-behaved trajectory is built out of prompts that each do their job, and agent prompt testing is how you keep each one doing it.

The prompt is the portable part

There is a practical reason to start at the prompt layer: it is the piece that survives everything else. You will swap models when a better or a cheaper one lands. You will move off one agent framework and onto another, or drop the framework and orchestrate the loop yourself. Through all of it, the extraction prompt and its rubric are the same artifact, and the dataset you test them on is the same bar. A prompt test does not care which framework runs the agent, so it keeps paying off across the churn that agent tooling is going through right now.

It is also the layer you can start on today, without waiting for a full agent-testing platform to settle. If your prompts are already versioned and you already regression-test them, testing the ones inside an agent is not a new discipline. It is the same discipline, pointed at where the agent actually breaks. The docs walk through setting up a dataset, a run, and the checks.

FAQ

What is agent prompt testing?

Scoring the prompts inside an AI agent against a fixed dataset, the same way you would test any prompt. You pull a prompt out of the agent (the system prompt, a tool-selection prompt, an extraction prompt), run it over representative inputs, and grade the outputs with deterministic checks or an LLM judge, producing a score you can compare across versions.

How is agent prompt testing different from agent evaluation?

Agent evaluation judges the whole trajectory: did the agent pick the right tools, in the right order, and recover when one failed. Agent prompt testing judges one layer down, the individual prompts and the outputs they produce, against a fixed dataset you control. Behavior evaluation is the integration test; prompt testing is the component test. Most teams want both.

Can I test my agent's prompts without a dedicated agent-testing platform?

Yes. A prompt inside an agent is still a prompt with an input and an output, so you can test it in isolation: a dataset of real inputs, deterministic checks, and an LLM judge for the open-ended parts. You can also score the outputs the agent already produced in production with a judge-only run, without re-running the agent.

Test your agent's prompts free   What is prompt evaluation?

Built by Homemade Software. Disagree with any of this? support@completionkit.com or r/completionkit.

← All posts