Git for prompts.
With eval built in.

Version every prompt. Diff outputs across Claude, GPT, and Gemini. Fail your CI on regressions. One command, one dashboard, one source of truth.

$ promptfork test summarize_ticket → running v12 vs baseline v11 across 3 models, 30 test cases ✓ claude-sonnet-4-6 30/30 pass $0.04 1.2s avg ✓ gpt-5 28/30 pass $0.07 2.1s avg ✗ gemini-2.5-pro 26/30 pass $0.03 1.8s avg 4 regressions — 2 cosmetic, 1 functional, 1 critical → report: https://promptfork.online/runs/r_8f2a91 → exit 1
Try free — no signup → ⭐ Star on GitHub

Try it — no signup, 3 free runs

Type a prompt, hit Run, see all 3 models side-by-side with cost + latency.

For teams shipping AI features.

If your product calls an LLM in production, every prompt change is a deploy waiting to bite you. PromptFork Diff is the harness you should already have.

Model swaps silently break prompts

GPT-4o → 4.1, Claude 3.5 → 4.6 — output shape shifts, JSON fields drop, tone changes. You find out from a customer ticket.

"Improvements" that regress

A teammate tweaks a prompt to fix one case. Two weeks later, your extraction quality is down 8% — and nobody knows when.

"Which model is cheapest?"

Currently answered by a junior dev pasting 5 inputs into 3 playgrounds. Half a day of work, every model release.

No CI gate for prompts

Code has tests. Prompts have vibes. A bad prompt change passes review and breaks features in prod.

How it works.

Five primitives. No lock-in. BYOK supported.

01

Version every prompt

Like git for prompt strings — semver-style tags, a diff view, one-click rollback.

02

Pin test cases

JSON or YAML inputs + expected outputs. Lives next to the prompt, version-controlled together.

03

Run cross-model

Claude, GPT, Gemini, in parallel. Captures cost, latency, token usage per call.

04

LLM-judge regression scoring

Haiku ensemble votes on whether the candidate output is strictly worse than baseline. Severity 1–3.

05

Block bad PRs

GitHub Action runs `promptfork test`, comments the diff matrix, exits non-zero on regression.

06

Cost + latency telemetry

Every call captures input/output tokens, $ cost, and latency. Pick the cheapest model that still passes your suite.

Pricing.

BYOK keeps inference costs your own. We charge for the harness.

Free

$0
  • 3 prompts
  • 50 test runs/mo
  • BYOK
  • OSS CLI
Start free

Team

$199/mo
  • 5 seats
  • 10,000 runs/mo
  • Shared workspace
  • PR comment integration

Startup

$499/mo
  • 20 seats
  • 50,000 runs/mo
  • Audit log
  • Priority support