May 22, 2026 · The CompletionKit team

Prompt versioning: treating prompts like code

Prompt versioning is treating a prompt the way you treat source code: every change creates a new immutable revision with a stable identifier, a parent revision it descended from, and a readable diff. A revision is locked once published, so the next edit becomes a brand new revision rather than overwriting the previous one. The point is that every score, every rollout, every "the model started doing X on Tuesday" can be traced back to exactly which prompt was live, because the prompt that was live is still there, named, untouched, and addressable.

It exists because the alternative quietly loses every property you depend on for code. A prompt edited in a Notion page or hard-coded into a release branch has no identity of its own: there is no fixed name for "the version that was running on Tuesday at 4pm", no way to compare it line by line with what's running now, and no way to switch back without a new deploy. You don't really have a prompt; you have whatever the file currently says. Versioning closes the gap by giving each revision the three properties that make a software artifact worth the name: stable identity, diffability, and rollback.

Borrowed from version control

Source-code versioning predates the prompt by half a century, and the lineage is worth tracing, because the design moves that work for prompts are the same moves that worked for code.

The first modern revision control system was SCCS (Source Code Control System), built in 1972 by Marc Rochkind at Bell Labs. The original was written in SNOBOL4 on an IBM System/370; Rochkind rewrote it in C for Unix on the PDP-11 in 1973, and it shipped publicly as part of the Programmer's Workbench in February 1977. SCCS introduced the move every version control system has used since: store revisions as deltas (compact records of what changed), not as full copies of the file.

A decade later RCS (Revision Control System), released by Walter Tichy at Purdue in March 1982, refined the idea with reverse deltas: store the newest revision in full and the older ones as instructions for stepping backward. That is the right shape for the common case, since the version you read most often is the latest one. RCS was free and shipped on Unix where SCCS had not been freely available, and it made revision control something every developer (not just every Bell Labs employee) actually had on their machine.

Then in April 2005, Git arrived, written by Linus Torvalds in about ten days, after the Linux kernel project lost its free license to BitKeeper. The first commit landed on April 7, 2005, and Junio Hamano took over as maintainer that July. Git's contribution was content-addressable storage: every revision is named by a hash of its content, so the identifier is the artifact, and a name can never silently point at different bytes than it did before. That last property is what makes "the version that was running on Tuesday" a thing you can actually fetch.

A prompt version reuses every one of these moves: immutable revisions, a parent pointer to the revision it came from, a stable identifier, a diff against any ancestor. What changes is what the bytes are for. With source code, the bytes are run by a compiler. With a prompt, the bytes are read by a model, and "what changed" is a much trickier question than a line diff alone can answer. That is where evaluation enters: a score attached to a version is the part of "what this revision does" that a text diff cannot show.

The three properties

Strip versioning down and you find the same three properties for any artifact worth versioning, prompts included.

Stable identity per revision. Each change produces a new revision with its own permanent name. Once published, the bytes behind that name never change. If you want to edit it, you create a new revision descended from it. That immutability is what lets every other system (eval results, rollout logs, support tickets, incident reports) refer to a prompt without ambiguity. "Prompt v17" must mean exactly one thing forever.
Diff. Given any two revisions, you can see what changed line by line. Diffs are cheap because revisions are immutable: nothing is being compared against a moving target. For a prompt, a diff is the entry point to every review and every post-mortem ("what did we change between the version that worked and the version that doesn't?"), and it is what makes a prompt edit reviewable at all.
Rollback. Reverting is pointing the runtime at an earlier revision that already exists. The bytes don't move; only the choice of which revision is current does. Rolling back is a switch, not a recreation. This is the property people undervalue until the night they need it, and it's the one a "save the new text" workflow throws away most completely.

If a system gives you all three, you have versioning. If it gives you only some, you have a save button with extra steps. Prompt regression testing depends on having all three: a regression catches a behaviour that a previous revision had and a new revision does not, which is only a sentence you can write when both revisions still exist by name.

Every published revision is immutable. Editing the prompt creates a new revision; rolling back just moves the PUBLISHED pointer to an earlier one. The bytes never change.

What the prompt-shaped version adds

Prompts are not source code, and a few details about how they're used end up shaping the version.

First, a prompt's behaviour is not a property of its text alone. The same string can score 4.2 against one dataset and 2.8 against another, or 4.2 yesterday and 3.9 today against the same dataset because the model behind it shifted. So a useful version of a prompt carries more than the string: the model id it was tested against, the eval runs that scored it, the rubric those runs used. The diff between two prompts is text; the difference between two prompt versions is text plus the evidence each one accumulated.

Second, the "compile then run" model from code doesn't quite fit. Code is compiled once and copied to every machine that runs it; a prompt is fetched at request time by whichever client needs it. That means every published revision has to be reachable forever (or at least as long as anything might want to read it), not just the latest one. Old revisions are not historical curiosities; they are still load-bearing for any request whose configuration pinned to them.

Third, draft state matters more than it does for code. With code, the version that hasn't shipped yet lives on a branch, untouched by production. With a prompt, drafts are the working state you score eval runs against, dozens of times in an afternoon, while looking for the change that improves the metric. The lifecycle is draft (mutable, scored repeatedly), then publish (snapshot into an immutable revision), then later possibly republish a new draft on top of the same prompt id. Drafts are for iteration. Published revisions are for the runtime.

The published-URL pattern

In CompletionKit, every published prompt revision has a URL. Your application fetches it, by id, at the moment it needs to render a request. That sentence is short and it does a lot.

The tested artifact and the served artifact are the same object. The string that scored 4.3 yesterday is the same string your app receives this afternoon, because both are loaded from the same revision. There is no "we exported the prompt into a release branch and then someone edited it" gap.
Rollback is a switch, not a deploy. Pointing your app at a different revision id is a config change; it does not require shipping new code. If the wrong version ended up live, you can put the right one back in seconds, from the same place you published it.
Old revisions stay addressable. A request that pinned to revision 12 last quarter can still resolve to revision 12 today, which means a one-off escalation does not require anyone to remember which prompt was running back then.

The docs walk through publishing a prompt revision, pointing your app at it, and the rollout pattern (pin to an explicit revision in production, follow the latest published revision in staging) that most teams settle on.

Why a git history alone isn't enough

The objection comes up immediately: "my prompts are already in git". Putting a prompt in git is good, and it gives you part one (immutable history) and part two (diffs against any past commit). It does not, on its own, give you the third part: the runtime resolution of which revision is live right now and who flipped that switch.

A git history records that the file changed at commit abc123 and changed again at def456. It does not record that the deployed app fetched the abc123 string between Monday and Thursday and the def456 string from Thursday onward, that someone rolled back to abc123 on Friday morning at 09:42, or that the eval run scoring 4.3 was run against the exact bytes of abc123 (not "the version of the prompt file at the time"). All of that is operational state that lives next to the artifact, not inside its source file.

The fix isn't to abandon git; it's to put the prompt somewhere whose job is to be the source of truth for which revision is live, and to make publishing an explicit step. Then git records the source history, the prompt store records the deployment history, and the two are connected by the immutable revision id.

FAQ

What is prompt versioning?

Treating each change to a prompt as a new immutable revision with a stable identifier, a parent revision it descended from, and a readable diff. Once published, a revision is locked: edits create a new revision instead of overwriting the previous one, so every score, run, and rollout history can be traced back to a specific prompt.

Why not just put my prompts in git?

A git history is one half of versioning, the file half. The other half is what your production code resolves at runtime: which revision is currently live, who flipped the switch, when, and whether the same revision was the one you tested. Putting a prompt in git records the text changes. It does not record which version is published, how to roll back without redeploying, or which scored eval run corresponds to which deployed string.

How is rollback different from editing the prompt back to its old text?

Rollback points the runtime at an earlier revision that already exists, by id; the bytes don't change, only which revision is current does. Editing the text back recreates a similar string as a brand new revision, with a new id and no history of being the one that worked before. The first is a switch you can flip in seconds; the second is a new artifact you have to test from scratch.

What is a draft version versus a published version?

A draft is a mutable working revision you can edit and re-test as many times as you want without affecting production. A published revision is immutable: it has a permanent identifier, it cannot be edited in place, and editing creates a new revision descended from it. Drafts are for iteration; published revisions are for the runtime to fetch.

Publish a versioned prompt free Why we built CompletionKit

Built by Homemade Software. Disagree with any of this? support@completionkit.com or r/completionkit.

← All posts