Skip to content

CI/CD for LLM applications

CI/CD for an LLM application is gating every change that touches the model the same way you gate a code change: re-run a scored evaluation on the proposed version, compare per-row scores to the previous baseline, and fail the build on any drop you didn't sign off. The artifact under test is a prompt revision (or a model bump, or a rubric edit), the test suite is your eval dataset, and the green signal is the per-row diff coming back without a row you didn't expect to move. From there, continuous delivery is the second half. A passing gate publishes the new prompt revision; your runtime fetches it by id, the same way it already fetches a feature flag.

This piece is the wiring. For the concept underneath (why a fixed dataset, what a per-row diff catches that an average doesn't) start with prompt regression testing; for the publish half, see prompt versioning.

From code CI to prompt CI

Continuous integration as a named practice goes back to Grady Booch's 1991 book "Object Oriented Design with Applications", which first put the term in print. Kent Beck folded the practice into Extreme Programming at the end of the decade, and Martin Fowler's writing in the early 2000s made it the default way teams thought about merging. The "CD" half got its own argument in Jez Humble and David Farley's 2010 book "Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation", which made the case that a release should be a button you press, not a weekend you take. Pipeline-as-code-next-to-code is younger. GitHub Actions went generally available on November 13, 2019, and within a couple of years it was where CI lived for most of the Ruby, Python, and TypeScript codebases LLM apps are built on. None of this post is specific to Actions; GitLab CI, Buildkite, CircleCI, Jenkins, anything with a JSON-speaking step and an exit code, runs the same loop.

What changes when the artifact under test is a prompt is not the shape of CI. It is what counts as the test passing. A unit test is a boolean. A prompt run is a scored distribution across a fixed dataset, and the gate has to decide pass or fail from the diff between this run's scores and the last run's. The mechanics carry over; the predicate at the end is what's new.

What the gate should cover

A useful CI gate covers more than the prompt template, because anything that meets the model on the way in can change the behaviour on the way out. The list to keep behind a gate is the same list a regression suite watches:

  1. The prompt template. Any edit to the user-facing instructions or to the variable interpolation.
  2. The model snapshot. Bumping from one model id to another, even within the same family. "Drop-in" upgrades almost never are.
  3. The system prompt and any preamble. Often lives outside the prompt-of-record file and still shapes the answer.
  4. Tool definitions and schemas. A new parameter, a renamed field, a tightened enum.
  5. Sampling parameters. Temperature, top-p, max tokens. Invisible in a code diff, visible in the scores.
  6. The judge rubric. The measuring stick is part of the test; a rubric edit invalidates a like-for-like comparison until you rebaseline.

If your CI fires on a change to any of these, the regression suite is doing its job. If it fires only on prompt-template edits, the model bump that quietly shifts tone next week will reach production without a single failing check.

The gate in five steps

The pipeline, regardless of CI vendor, is the same five-step loop. The eval surface in CompletionKit is the REST API; every step below is a single HTTP call.

  1. Create the proposed run. POST to /api/v1/runs with prompt_id, dataset_id, and the metric_ids you want scored. For a prompt edit, point at the new draft revision's id; for a model-only change, point at the same prompt id and override judge_model or temperature on the run.
  2. Kick it off. POST to /api/v1/runs/:id/generate. The response is 202 Accepted and the work happens in the background.
  3. Poll until done. GET /api/v1/runs/:id on a slow loop and watch status. It cycles pending, running, and ends at completed or failed. The same payload carries avg_score and a progress block, so a build log can tail the percentage as it runs.
  4. Read the diff against the baseline. GET /api/v1/runs/BASELINE_RUN_ID/compare?with=NEW_RUN_ID. The response is {rows: [...], metric_ids: [...]} with one entry per input case, the score on each side, and a delta per metric, computed as the new run's score minus the baseline's, so a regression reads as a negative delta. Cases the baseline has that the new run does not come back with the new side nulled out.
  5. Decide pass or fail. Exit non-zero if any row's per-metric delta crosses your tolerance, or if a metric's average dropped by more than a smoothing band. The exit code is what your CI provider reads to mark the check green or red.

The same surface works in reverse for the dataset half. If a new edge case showed up in production overnight, append it to the dataset, capture a fresh baseline of the live prompt under the new conditions, and let the next CI run gate every future change against that updated bar.

A worked GitHub Actions job

Here is a complete job that drives the gate end to end. It uses curl and jq so it doesn't pull in a client library; the org slug, token, and the prompt and dataset ids live in repository secrets and variables. The baseline run id is the id of the last green run for the live prompt; read it from /api/v1/runs?prompt_id=ID&status=completed&limit=1 or pin it to a known good run.

name: prompt-eval
on:
  pull_request:
    paths:
      - "prompts/**"
      - ".github/workflows/prompt-eval.yml"

jobs:
  gate:
    runs-on: ubuntu-latest
    env:
      BASE: https://completionkit.com/orgs/${{ vars.CK_ORG_SLUG }}/api/v1
      AUTH: "Authorization: Bearer ${{ secrets.CK_API_TOKEN }}"
      PROMPT_ID: ${{ vars.CK_PROMPT_ID }}
      DATASET_ID: ${{ vars.CK_DATASET_ID }}
      METRIC_IDS: "[1, 2, 3]"
      BASELINE_RUN_ID: ${{ vars.CK_BASELINE_RUN_ID }}
      ROW_TOLERANCE: "-1.0"
      AVG_TOLERANCE: "-0.2"
    steps:
      - name: Create the run
        id: create
        run: |
          run_id=$(curl -fsS -X POST "$BASE/runs" \
            -H "$AUTH" -H "Content-Type: application/json" \
            -d "{\"prompt_id\": $PROMPT_ID, \"dataset_id\": $DATASET_ID, \"metric_ids\": $METRIC_IDS}" \
            | jq -r '.id')
          echo "run_id=$run_id" >> "$GITHUB_OUTPUT"

      - name: Start generation
        run: curl -fsS -X POST "$BASE/runs/${{ steps.create.outputs.run_id }}/generate" -H "$AUTH"

      - name: Wait for the run to finish
        run: |
          for i in $(seq 1 60); do
            status=$(curl -fsS "$BASE/runs/${{ steps.create.outputs.run_id }}" -H "$AUTH" | jq -r '.status')
            echo "status=$status"
            case "$status" in
              completed) exit 0 ;;
              failed)    echo "Run failed"; exit 1 ;;
            esac
            sleep 10
          done
          echo "Timed out waiting for the run"; exit 1

      - name: Gate on the per-row diff
        run: |
          curl -fsS "$BASE/runs/$BASELINE_RUN_ID/compare?with=${{ steps.create.outputs.run_id }}" \
            -H "$AUTH" > diff.json

          jq --argjson rt "$ROW_TOLERANCE" --argjson at "$AVG_TOLERANCE" '
            (.rows | map(.per_metric[] | select(.delta != null and .delta < $rt))) as $rows
            | (.rows
                | map(.per_metric[] | select(.delta != null))
                | group_by(.metric_id)
                | map({metric_id: .[0].metric_id, mean_delta: (map(.delta) | add / length)})
                | map(select(.mean_delta < $at))) as $avgs
            | { rows: $rows, avgs: $avgs }
          ' diff.json | tee regressions.json

          if [ "$(jq '(.rows | length) + (.avgs | length)' regressions.json)" -gt 0 ]; then
            echo "Prompt regression detected"; exit 1
          fi

Two tolerances, two failure modes. ROW_TOLERANCE at -1.0 fails the build the moment any single row drops a full point on any metric; that is what catches the fix-that-breaks-the-neighbour case where the average barely moves and a customer-facing row quietly broke. AVG_TOLERANCE at -0.2 fails it when a metric's mean drops by more than two-tenths of a point across the dataset; that is what catches the slow drift you wouldn't see one row at a time. Both numbers are starting points, not laws. Tune them to your dataset's noise floor by running the same prompt twice and reading the natural delta. The half of the story about why a judge varies, and how to set it up so the number is trustworthy, is in LLM-as-a-judge.

Publishing the prompt: the delivery half

Once the gate is green, the deploy is a publish. CompletionKit's versioning is built around an immutable revision id and a runtime that fetches a prompt by that id at request time. The CD step is a single POST to /api/v1/prompts/:id/publish, fired after the merge, with the prompt id of the draft you scored.

Two patterns are worth knowing apart. Latest-follower staging fetches whichever revision is currently published, so a publish in CI is immediately live there for QA. Pinned production fetches an explicit revision id from configuration, so a publish is staged but not yet live until the runtime pin is bumped. Most teams run both: the follower in staging, the pin in production, with the production pin moving on a separate, deliberate step. Rollback is the same move in reverse; flip the pin to an earlier published id, no redeploy required.

If you would rather hand iteration to a coding agent than write the YAML, the MCP server wraps the same surface as a tool list Claude Code or Cursor can drive. The agent route is best for the proposing half, sketching a candidate revision and scoring it; the CI gate is what keeps an agent's edits, or a human's edits, from reaching production unscored.

Where this breaks, and what to do

Three things bite the first time you run the gate against a real prompt at real scale.

  1. Token cost on every PR. A 50-row dataset scored by a frontier judge on every push is noticeable on the bill, especially if you also fire on draft commits. Two cheap mitigations: gate on path filters so the eval only fires when a prompt-bearing file changes, and use a smaller "smoke" dataset of the highest-yield 10 rows on push, falling back to the full dataset on merge to main.
  2. Flakiness from judge variance. Two identical runs of the same prompt against the same dataset can land at 4.41 and 4.38, because the judge is itself a model. A row-level tolerance of -0.2 would fail half your green builds. Set the row gate at a full point or more, set the average gate above the dataset's natural noise floor, and accept the small loss of sensitivity in exchange for builds that mean something.
  3. An aging baseline. When the underlying model is upgraded, or the rubric is rewritten, the previous baseline is no longer a like-for-like comparison and a regression call against it is meaningless. Run the gate against a freshly captured baseline of the current prompt under the new conditions before pulling the trigger on the change you actually wanted to ship. POST /api/v1/runs/:id/rerun is the short path to that fresh baseline.

None of these are reasons to skip the gate. They are reasons to set it up deliberately, with a dataset that earns its keep, a tolerance that matches the judge's noise, and a baseline that gets re-captured every time the underlying conditions shift.

FAQ

What does CI/CD mean for an LLM application?

Treating every change that touches the model (a prompt edit, a model snapshot bump, a rubric rewrite, a tool-schema tweak) the way you treat a code change. Run a scored evaluation against a fixed dataset on the proposed version, compare per-row scores to the previous baseline, fail the build on any drop you didn't sign off on, and only publish the new prompt revision once that gate is green.

How do I fail a GitHub Actions build on a prompt regression?

Have the workflow POST to /api/v1/runs with the proposed prompt and metric ids, kick it off with /generate, poll /api/v1/runs/:id until the status is completed, then read /api/v1/runs/BASELINE_ID/compare?with=NEW_RUN_ID and exit non-zero if any per-metric delta is negative past your tolerance (delta is the new run's score minus the baseline's). The compare endpoint returns one row per input case with a delta per metric, so a small jq script is enough to decide pass or fail.

What threshold should I use for the regression gate?

Two gates, not one. Set a small tolerance on the per-metric average (so a tiny noise blip doesn't fail a green build) and a tighter row-level gate that fails on any row whose score drops by more than a point. The average catches systematic drift; the row gate catches the fix-that-breaks-the-neighbour case the average hides.

Does this work for model upgrades, not just prompt edits?

Yes. The gate is on the meeting of the prompt and the model, not the prompt text. A model snapshot bump, a temperature change, a tool-schema edit, or a system-prompt revision all go through the same CI run against the same dataset. Anything that shifts model behaviour belongs in the gate, because it can regress the same way.

Can I do this without GitHub Actions?

Yes. The five-step loop (create the run, kick off generation, poll for completion, read the compare endpoint, exit non-zero on a regression) is a handful of HTTP calls in any language. GitLab CI, Buildkite, CircleCI, Jenkins, or a make target on a developer machine all work the same way; the YAML in this post is convenience, not a requirement.

Wire up the gate free   Read the API reference

Built by Homemade Software. Disagree with any of this? support@completionkit.com or r/completionkit.

← All posts