Skip to content

Changelog

Notable changes to CompletionKit Cloud. The self-hosted engine has its own changelog in the completion-kit repository.

2026-06

  • Engine 0.16.1: starting a run over MCP no longer reports a false failure. On the hosted deployment, an MCP agent calling runs_generate could get an error back for a run that had in fact started, inviting a retry and a double-generate. The underlying route-warmup error is fixed and run-start UI broadcasts are now best-effort, so a cosmetic broadcast problem can never fail a started run.
  • Engine 0.16.0: validated prompt suggestions, a properly scoped judge, and no more OpenAI-by-default. A prompt-improvement suggestion now re-runs the candidate prompt against a held-out slice of the run's reviewed responses and shows a before/after scoreboard (improved, held, regressed) before you publish; a rewrite that scored net-negative asks for confirmation instead of applying in one click. The judge no longer reads the full prompt when scoring metrics outside its scope, so a prompt's banned-word list can't move an unrelated correctness score (output-quality scores may shift as that bias disappears). And the engine stopped assuming OpenAI: each organization's default judge model is now resolved from its own discovered models, so an Anthropic-only org gets a Claude judge for metric suggestions instead of a failing call to a hardcoded gpt-4.1, and the onboarding sample prompt picks a model you actually have.
  • Engine 0.15.1: Anthropic model discovery no longer false-negatives on thinking-by-default models. The capability probe read the first content block of the Anthropic response, but recent Claude models think by default and return a thinking block first, so the probe saw empty text and wrongly flagged the model as unable to generate or judge. It now scans for the text block past any leading thinking block; a response with no text block at all is still correctly treated as a failed probe.
  • Engine 0.15.0: provider edit-page Refresh works again, and MCP prompts_update drafts instead of auto-publishing. Hitting Refresh on a provider's edit page now shows the spinner and live progress; the polling guard used to bail when the completed models card had no discovery-status element, which was exactly the state you click Refresh from. Separately, prompts_update over MCP, on a prompt that already has runs, now creates the new version as a draft and leaves promotion to prompts_publish, so an agent's edits get a review gate instead of going straight to current (a prompt with no runs is still updated in place). The MCP blog FAQ was corrected to match.
  • Engine 0.14.0: create datasets from a URL or an uploaded CSV file. The REST POST /api/v1/datasets endpoint now also accepts a multipart file upload, not just an inline csv_data string. A new MCP tool datasets_create_from_url downloads a CSV from a public http(s) URL on the server (SSRF-checked, capped at 10MB) so a large dataset never has to be inlined into the tool-call arguments. Both routes land on the same authenticated, tenant-scoped dataset create path.
  • Engine 0.13.0: MCP tool judges_suggest renamed to metrics_suggest_variants, plus a new prompts_suggest_improvement tool. The metric-variant suggestion tool moved into the metrics namespace to sit beside the REST metrics#suggest_variants. There is no alias kept, so any MCP client that calls it by a hardcoded name needs updating; clients that discover tools through tools/list pick up the new name automatically. The new prompts_suggest_improvement tool suggests an improved prompt template grounded in a run's test results and judge feedback (the same engine behind the web "Suggest improvements"), returning the reasoning and rewritten template and persisting a suggestion. It takes a run_id and rejects judge-only runs that have no prompt.
  • Engine 0.12.4: the MCP server runs statelessly, so coding agents stop getting disconnected. Previously every MCP call after the first had to carry a live session id, and once a session hit its idle timeout the next request was rejected with a "Session not initialized" error, forcing the client to reconnect mid-task. The server now serves tool calls without requiring a prior initialize or a valid session id, per the Streamable HTTP transport spec. initialize still issues a session id for clients that use one, and DELETE still ends a session.
  • Engine 0.12.0 to 0.12.3: provider model discovery updates live again. Adding a provider used to leave the Providers page stuck on "Looking up models…" with no progress, and the Refresh button on the prompt and run forms did nothing, because the engine's live-update calls were pinned to its standalone mount path and 404'd under Cloud's per-tenant /orgs/:slug mount. The engine now drives discovery progress from a polled status endpoint and renders the poll and refresh URLs from route helpers, so they follow whatever mount Cloud uses. The model capability probe also stopped over-asking for output tokens, which had wrongly flagged Anthropic models like Haiku 4.5 and the Opus line as unable to generate, and Refresh now re-checks models that previously failed the probe.
  • Email verification takes a click now. Following the link in the verification email lands on a confirmation page with a button instead of verifying the instant the link is opened. Corporate mail-security scanners (Microsoft Safe Links, Proofpoint, Mimecast and the like) auto-fetch every link in inbound mail, which was silently marking addresses verified with no human involved. The state change now happens on the button's POST, which scanners don't submit, so a verified address means a person actually clicked through.
  • More member seats; seats are no longer an upgrade lever. Free now includes 3 members (was 1), Starter 10 (was 3), and Team is unlimited (was 10). The tiers differ on runs, prompt-fetches, history retention, SSO and audit log, and support, not on seats.
  • Engine 0.9.0 to 0.12.0: wider MCP surface, judge-suggest over MCP, calibration table renamed. The engine's MCP server now exposes the full toolset (datasets, metrics, prompts, responses, provider credentials, metric groups, tags) plus judges_suggest, which rewrites the judge instruction into draft metric-version variants targeted at recent disagreements, so a coding agent can drive the whole evaluate-improve loop, including drafting the next revision, over MCP. Reviews now require a metric_version. The completion_kit_calibrations table was renamed to completion_kit_agreements; the rename carried cloud's organization_id, foreign key, and RLS tenant_isolation policy across intact, and the seed sets each review's version via MetricVersion.ensure_current_for.

2026-05

  • Engine 0.9.0: calibration is measure + rewrite only; few-shot pinning removed. The engine drops the few_shot_examples column from completion_kit_metrics and removes the add_few_shot / remove_few_shot routes from both web and REST API namespaces. The calibration loop is now: review verdicts, watch the trust panel (Wilson interval + kappa), use the auto-suggest improvement to draft a rubric rewrite. Cases-to-learn-from feeds the rewrite, not a pinned-examples list. Plus a broad copy / UX polish pass on the metric show page (Versions sit above Calibration, suggestion banner moves to the edit form with icon buttons, publish modal clarifies draft-until-publish). Cloud-side scrubbed all marketing references to few-shot pinning from /docs Trust + Concepts, /glossary, and the why-we-built FAQ.
  • Engine 0.8.0: canonical REST API error shape + starter-pack page fix. Every engine REST API error response now uses one shape: a top-level error string, plus a details field map on 422 validation failures. Replaces three prior shapes ({error: "..."} for auth/business rules, {errors: ActiveModel::Errors} for validation, {errors: ["..."]} for run start/rerun failures). Cloud's existing Billing::QuotaResponse and EngineApiAuthentication already used the canonical shape, so no cloud-side changes were needed. Also fixes the /orgs/:slug/metrics zero-state UI (the engine added a real .ck-title--sm rule and switched .ck-starter-grid to five-across so the empty-state heading stops outshouting the page H1 and the fifth starter card no longer orphans).
  • Mailer: discard permanent SMTP failures. A bot signup with an SMTP-rejecting recipient hit Net::SMTPSyntaxError inside ActionMailer::MailDeliveryJob; default Solid Queue retries kept firing Honeybadger. Subclassed MailDeliveryJob to discard on Net::SMTPSyntaxError and Net::SMTPFatalError (any 5xx permanent failure), log the bouncing recipient address, and send exactly one Honeybadger event with the recipient in context.
  • Engine 0.5.44 → 0.7.0: metric versioning, regrade-only runs, REST/MCP parity. The internal "judge version" concept is now consistently named "metric version" everywhere (model class, table completion_kit_judge_versionscompletion_kit_metric_versions, FKs, indexes), and 0.6.0 dropped the backward-compat aliases. Reviews now carry the metric_version_id they were judged under, so the response detail page and run show page surface which review rows are stale (judged by an older version) and a "Re-run with current judge" button on the run. The improvement loop and retry_failures flow refuse to act on stale reviews. A new regrade-only run re-judges a run's existing responses against the current metric version without re-generating any model output. Reverting to an older metric version now writes a revert audit row. Cross-run comparison view shows side-by-side per-case score deltas with version chips. The REST API gains pagination, tag filter, and resource-specific filters across every index endpoint, a flat /api/v1/calibrations with filters + destroy, and MCP tools for metric-version management. Turbo broadcasts moved into Response and Review after_save_commit callbacks; cloud added ResponseBroadcastSafety / ReviewBroadcastSafety mixins (matching the existing ProviderCredentialBroadcastSafety pattern) so broadcasts swallow URL-helper errors in non-request contexts like seeds and background jobs.
  • Signup bot defences. The signup form now ships with an invisible honeypot (random hidden field + form-load timestamp threshold) via the invisible_captcha gem. The Slack signup notification fires on email verification rather than on user create, so bots that haven't validated a real inbox no longer ping the team channel. A daily UnverifiedUserPruningJob deletes accounts that never verified within 7 days, taking their sessions with them via the existing dependent: :destroy chain. Honeypot tripping silently redirects to the same "check your email" success page, so spam-runners can't tell the form rejected them.
  • Engine 0.5.43: judge versioning surface + calibration scoped to the live version. Metrics now carry a proper versions table mirroring prompt versioning (per-metric version_number, published_at, source chips for "AI suggestion" vs hand edits). The trust panel, the disagreements list, and the auto-suggest improvement loop all scope to the metric's current judge version by default, so reverting or republishing the rubric no longer mixes signal across versions. The few-shot examples surface gets a real "forget this" action and the cards are now actually fed back into the judge prompt. Rubric edits show a per-band side-by-side diff (unchanged bands dimmed, changed bands word-diffed) on both the AI-suggested and hand-edited drafts. Response detail page surfaces what other operators on the org said about the row. Schema-only follow-up: the new version_number / published_at columns sit on the already-tenanted completion_kit_judge_versions table, no new tables to wire.
  • Engine 0.5.42: starter metrics pack. The Metrics index now leads with five preconfigured one-click cards (Correctness, Instruction following, Format compliance, Tone, Conciseness) when an org has no metrics yet, and shows whichever haven't been adopted as an "Add a starter metric" row when the index is non-empty. Each card opens a preview modal with the full 1-to-5 rubric the judge will use before anything gets created. Per-org "Don't show this one" dismissals persist. Cloud added cascade-deleting organization_id to the new completion_kit_starter_metric_dismissals table, reworked its uniqueness to be per-org, applied RLS, and shipped isolation specs. The day-seven activation drip email now points at the pack rather than abstract rubric advice.
  • Engine 0.5.40 – 0.5.41: calibration UX cleanup. The "Improve the metric" flow is now a single inline suggestion card on the metric page (no more /improvements page, no banner, no queue), gated on having at least one Disagree verdict so the model has signal to work from. Save feedback is the Save button itself flashing green for 1.4s. The disagreements section becomes "Cases to learn from" with one card per case (no overflowing table at medium widths) and the verdict button reads "Remember this" instead of "Teach the judge". Onboarding panel's redundant divider hairline is gone. Response JSON output gets a light syntax highlight and scrolls horizontally instead of wrapping. No new tables: purely UI and copy.
  • Topbar polish at medium widths: the authenticated topbar now collapses to the hamburger up to 1024px (was 640px) so Dashboard / Prompts / Datasets / Metrics / Runs / Settings / Upgrade / Avatar stop wrapping onto two or three rows on small laptops and split-screen windows. The brand wordmark hides at the same breakpoint so the puzzle logo alone marks the spot. Settings-menu dividers also softened to a near-invisible line so the dropdown reads as one continuous list rather than three boxed groups.
  • Engine 0.5.33 – 0.5.39: privacy/a11y audit follow-through and judge calibration. 0.5.33 self-hosts JetBrains Mono (no more Google Fonts call), tightens functional cookie flags, raises the --ck-dim contrast token, ships PII and data-flow privacy docs, and lands a broad accessibility pass on the engine's authenticated pages. 0.5.34 hardens provider-endpoint SSRF re-validation and adds keyboard-accessible index tables. 0.5.35 – 0.5.38 add judge calibration: per-row agree/disagree/borderline verdicts, a trust panel with Wilson interval and quadratic-weighted κ, a disagreements list that promotes rows into the judge's few-shot examples, and an auto-suggest button that runs the judge model against its own recent disagreements to draft rubric rewrites. 0.5.39 ships the engine stylesheet as plain CSS (no ERB) for Propshaft hosts, and tightens the disagree-pill flow so clicking it without a score reveals the slider instead of producing a validation error. Cloud added cascade-deleting organization_id FKs and RLS to the new completion_kit_judge_versions and completion_kit_calibrations tables.
  • Settings menu: cog icon and "Providers". The Settings trigger in the authenticated topbar is now the same cog-6-tooth heroicon the engine's standalone app uses, instead of the word "Settings". The Provider-credentials menu item and page title shorten to "Providers" to match. Less text in the header, same destinations.
  • 30-day organization purge. A daily job hard-deletes soft-deleted organizations after 30 days, matching the DPA's commitment. Cloud-owned tables cascade via the existing dependent: :destroy chain; engine-owned tables cascade through the on_delete: :cascade FK constraints the cloud-side migrations set when wiring each engine table for tenancy.
  • Self-service account deletion. The account page now has a Delete account section that destroys the user, their sessions, email preferences, and any invitations they've sent. Solo-owned organizations with no other members are deleted along with them; organizations where the user is the sole owner with other members still attached are listed and the deletion is blocked until ownership is transferred. The privacy policy points at this control instead of asking users to email.
  • Cookie consent banner with Google Consent Mode v2: visitors now see an opt-in banner before any GA4 cookies are set. We default consent to denied on every page load, the GA4 script still loads (so it can pick up the granted signal on the same pageview) but analytics_storage only flips to granted after the visitor clicks Accept. The choice is persisted in a first-party ck_cookie_consent cookie. Do Not Track and Global Privacy Control still suppress GA4 entirely, banner included.
  • Privacy: cookie disclosure tightened. Privacy policy now lists every functional cookie we set (name, purpose, type, expiry) instead of claiming a single session cookie. The ck_run_history_notice_dismissed cookie's expiry drops from 5 years to 1 year. Session and functional cookie writes set secure: Rails.env.production?, same_site: :lax explicitly on every path.
  • Privacy: data export broadened. The GDPR export now includes email-subscription state and invitations sent/received, on top of the user, organizations, and sessions it already covered.
  • Privacy: idle sessions purged. A daily job deletes Session rows past the 30-day idle timeout so IP and User-Agent history doesn't accumulate forever.
  • Accessibility statement at /legal/accessibility, linked from the marketing footer alongside the other legal pages. States our WCAG 2.1 AA target, known limitations, and a contact route at accessibility@completionkit.com.
  • Accessibility pass: every layout now declares lang="en" and exposes a Skip-to-content link plus a <main> landmark. Flash messages live in a persistent ARIA live region so screen readers reliably hear them after Turbo navigations. Dropdown menus and the feedback widget drop half-applied role="menu" / role="dialog" and rely on native <details> semantics. The pricing period toggle uses aria-current instead of fake tab roles. Form inputs have a visible focus ring. The docs and pricing pages lead with a proper <h1>. Logo images carry explicit width and height to prevent layout shift. The invitation form requires the email field, and the accept form drops placeholders that duplicated the label.
  • Editable organization slug: owners can now change an organization's slug from settings, behind a confirmation modal that spells out the consequences. The old URL keeps redirecting for 45 days; if it is still getting traffic after 15 days we email the owner a reminder before it stops working.
  • Workspace dashboard. Finishing or skipping onboarding now lands you on a real dashboard instead of the bare prompts list: workspace totals, a 14-day run-activity sparkline, your worst-scoring metric, failed reviews, and version-over-version prompt score changes. It's linked from the topbar. Onboarding steps also now carry a one-line definition of each concept.
  • Landing page for signed-in users: visiting the home page while signed in now shows the landing page with a Dashboard button instead of bouncing straight to your prompts.
  • Free-plan prompt-fetch limit: serving prompts through the API is now metered. Free includes 1,000 fetches a month, Starter 10,000, Team 100,000, Business unlimited. Building and evaluating a prompt stays free; running real production traffic through it is a paid use case.
  • Usage metering. Per-organization run counters with billing-anchored periods, soft warnings at 80%, hard blocks at 100%, and automatic reset each period.
  • Quota gating across all surfaces: the web UI, the REST API (402 Payment Required with a structured body), and the MCP server (isError tool result) all enforce run limits.
  • Usage notification emails: owners get a heads-up at 80%, a "runs blocked" email at 100%, and an "unblocked" note when the new period starts.
  • In-app feedback widget. A small panel on every page that emails support directly, with replies threaded back to you.
  • GA4 + Honeybadger Insights: the activation funnel is tracked in both, plus engine-side usage events in Insights.
  • Legal pages: Terms, Privacy Policy, DPA, responsible-disclosure policy. GDPR data export on the account page. /.well-known/security.txt.
  • Onboarding checklist: new organizations land on a getting-started dashboard with a four-step checklist, plus an optional sample dataset & prompt to poke around.
  • Disaster-recovery runbook: documented restore procedures and a quarterly drill cadence.
  • Drip email scaffold: a four-step activation series for newly-verified users, with one-click unsubscribe.
  • Robot avatars: profile avatars are now a set of distinctive robots; pick one in account settings or upload your own.
  • Branded error pages: 404 / 500 / 4xx pages now match the app's dark, terminal-flavoured look instead of Rails' defaults.
  • No trial period: upgrading to a paid plan now starts billing right away. The free tier is there for anyone who wants to try first.
  • Plans tidied up: API and MCP access on every tier (free included). Unlimited prompts and datasets on every tier. The per-plan member limit is now enforced. SSO and audit logs are labelled "coming soon" on Team rather than implied as shipped.
  • Run-history retention: Free keeps the last 30 days of run history, Starter the last 365; Team and Business are unlimited. Older runs are hidden, never deleted: upgrade to an unlimited-history plan and they reappear. A dismissable banner appears when runs are being hidden.
  • Engine 0.5.7: prompt version history with per-version diff modals; the prompts list shows each prompt's API endpoint with a copy button; new runs inherit the previous run's tags; model pickers and the provider-models table are grouped by OpenAI family / OpenRouter vendor; fixed a spurious "no worker running" banner; cleaner flash messages.
  • API page: one place under Settings → API: your bearer keys in a table (create with a named-and-validated form, copy, revoke), then the full REST + MCP reference with copy-paste examples wired to your live key. Replaces the separate "API tokens" page.
  • Engine 0.5.8 – 0.5.9: your prompts' endpoints now sit inside the Prompts section of the API reference; one font size for every code example there (curl + MCP install snippets); fixed the Metric Groups / Tags / Providers reference tabs rendering blank; renaming or retagging a run with results no longer forks a new version; fixed "now ago" relative timestamps.
  • Public API docs: the full REST + MCP reference is now at /docs/api, crawlable and linked from /docs, with generic YOUR_ORG/YOUR_TOKEN placeholders. The in-app version under Settings → API is the same reference, wired to your own keys.
  • Landing redesigned: new hero with an animated "anatomy of a run" visual, refreshed sections (Problem · How a run works · The unlock · Compared · Three ways), dual hero CTAs (free Cloud signup + GitHub with a live star count), footer follow-through.
  • Three ways to run it: Cloud (hosted), the bundled standalone Rails app (self-host), or the source-available Rails engine (embed in your app). Landing, docs, and pricing-page tagline all use the same framing now.
  • Pricing + Docs on-brand: both pages restyled to match the redesigned landing. Docs page rewritten with an Overview (what / problem / how / Next level: agents run the loop), a nine-term concepts glossary, and clearer "pick a place to start" footer.
  • Free tier: 20 runs / month (down from 50). Same product, tighter ceiling on the free side.
  • Judge-only runs: grade an existing column in your dataset without re-running a prompt. Drop in production logs, hand-curated examples, or outputs from another system; the LLM judge scores them against your metrics. (Engine 0.5.11.)
  • Engine 0.5.10 – 0.5.13: new puzzle-piece brand mark; MCP sessions survive Puma restarts and slide their expiry on activity (no more "session not initialized" mid-conversation); prose font sizes normalized across the API reference; prompts-show page tightened. The Response detail page now caps Input / Response / Expected at 28rem and scrolls inside the block (big LLM-as-judge inputs no longer dwarf the page), JSON payloads pretty-print automatically, run-table names truncate so the metadata columns stay clustered, and scrollbars finally pick up dark color-scheme on Safari.
  • Engine 0.5.14 – 0.5.15: judges on reasoning models no longer silently 1-star: reasoning models (GPT-5 Pro family, o-series) charge their reasoning tokens against the chat-completion max_tokens cap. With the previous 1000-token default, ~40% of large judge prompts truncated and an unlucky share returned empty content that the parser coerced to 1 (the worst score on the 1-5 scale). The default cap is now 8192, and both OpenAI and OpenRouter clients raise on truncated or empty completions instead of returning a blank. 0.5.15 closes the remaining gap: the parser now raises JudgeParseError instead of returning {score: 1, feedback: "Could not parse…"}, so parse failures surface as distinct failed reviews (excluded from averages) rather than silent 1-star floors. If you've been seeing inexplicably-low 1-star scores on reasoning-model judges, this is the fix: they should re-judge correctly on the next run.
  • Dashboard recent-runs table matches the rest of the app: same pips, name truncation, judge-only label, and metadata layout as runs/index, dataset/show, and prompt/show.
  • Favicon halo: a thin #64748b ring around the puzzle piece for better legibility against pale browser chrome.
  • Engine 0.5.16: every results table (runs, prompts, datasets, metrics, tags, responses, prompt versions, suggestions, provider models) now uses table-layout: fixed with explicit column widths so long names can't push the table past the section's right edge. Tag pills also got a brightness bump so multi-colored tag rows don't read as "some are washed out."
  • Engine 0.5.17: the run-detail dataset row count is correct again, and long metric names wrap instead of overflowing.
  • Engine 0.5.18 – 0.5.22: the dashboard now surfaces what's actually happening: recent activity, your worst-scoring metric, failed reviews, and recent prompt changes. Alert cards can be dismissed, and quietly reappear if the score moves again. And the whole app is now responsive: a hamburger nav, horizontally-scrollable tables, and stacked headers on small screens.
  • Engine 0.5.23 – 0.5.26: fixed a worker thread race that could disrupt live updates; the dashboard's ignored-metrics flyout is now sorted worst-score-first with a tidier scrollbar; and a second mobile pass tightened the engine's own pages at phone widths: no more horizontal overflow or clipped text.
  • Engine 0.5.27 – 0.5.29: security hardening: SSRF protection on provider API endpoints, and rate limiting added across login, the REST API, the MCP server, and the web UI.

2026-04

  • Cloud launch: completionkit.com and www both serve the hosted app on Render, backed by Supabase Postgres.
  • Stripe billing: Free / Starter / Team plans, Stripe Checkout, Customer Portal, webhook-driven subscription sync, 14-day Team trial.
  • Multi-tenancy: per-organization data isolation via application-layer tenant scoping plus Postgres row-level security.
  • Authentication: sign-up, email verification, password reset, account management, revocable database-backed sessions.
  • Marketing site: landing and pricing pages moved into the cloud app from GitHub Pages.
  • Operations: Honeybadger error tracking, uptime monitoring with a public status page, log aggregation, Mission Control jobs dashboard.