Git for prompts.
With eval built in.

Version every prompt. Diff outputs across Claude, GPT, and Gemini. Fail your CI on regressions. One command, one dashboard, one source of truth.

$ promptfork test summarize_ticket
→ running v12 vs baseline v11 across 3 models, 30 test cases

  ✓ claude-sonnet-4-6     30/30 pass     $0.04   1.2s avg
  ✓ gpt-5                  28/30 pass     $0.07   2.1s avg
  ✗ gemini-2.5-pro         26/30 pass     $0.03   1.8s avg

4 regressions — 2 cosmetic, 1 functional, 1 critical
→ report: https://promptfork.online/runs/r_8f2a91
→ exit 1
  

Try free — no signup → ⭐ Star on GitHub

Free runs used up.

Pick a plan to keep going. Real prompts, real test cases, real CI integration.

Solo

$49/mo

Team

$199/mo

Startup

$499/mo

Maybe later

For teams shipping AI features.

If your product calls an LLM in production, every prompt change is a deploy waiting to bite you. PromptFork Diff is the harness you should already have.

Model swaps silently break prompts

GPT-4o → 4.1, Claude 3.5 → 4.6 — output shape shifts, JSON fields drop, tone changes. You find out from a customer ticket.

"Improvements" that regress

A teammate tweaks a prompt to fix one case. Two weeks later, your extraction quality is down 8% — and nobody knows when.

"Which model is cheapest?"

Currently answered by a junior dev pasting 5 inputs into 3 playgrounds. Half a day of work, every model release.

No CI gate for prompts

Code has tests. Prompts have vibes. A bad prompt change passes review and breaks features in prod.

How it works.

Five primitives. No lock-in. BYOK supported.

Version every prompt

Like git for prompt strings — semver-style tags, a diff view, one-click rollback.

Pin test cases

JSON or YAML inputs + expected outputs. Lives next to the prompt, version-controlled together.

Run cross-model

Claude, GPT, Gemini, in parallel. Captures cost, latency, token usage per call.

LLM-judge regression scoring

Haiku ensemble votes on whether the candidate output is strictly worse than baseline. Severity 1–3.

Block bad PRs

GitHub Action runs `promptfork test`, comments the diff matrix, exits non-zero on regression.

Cost + latency telemetry

Every call captures input/output tokens, $ cost, and latency. Pick the cheapest model that still passes your suite.

Pricing.

BYOK keeps inference costs your own. We charge for the harness.

Free

3 prompts
50 test runs/mo
BYOK
OSS CLI

Start free

Solo

$49/mo

Unlimited prompts
2,000 test runs/mo
Cross-model parallel
GitHub Action
Email regression alerts

Team

$199/mo

5 seats
10,000 runs/mo
Shared workspace
PR comment integration

Startup

$499/mo

20 seats
50,000 runs/mo
Audit log
Priority support

Git for prompts.
With eval built in.

Try it — no signup, 3 free runs

Free runs used up.

For teams shipping AI features.

Model swaps silently break prompts

"Improvements" that regress

"Which model is cheapest?"

No CI gate for prompts

How it works.

Version every prompt

Pin test cases

Run cross-model

LLM-judge regression scoring

Block bad PRs

Cost + latency telemetry

Pricing.

Free

Solo

Team

Startup

Ship prompts like you ship code.

Git for prompts.With eval built in.

Try it — no signup, 3 free runs

Free runs used up.

For teams shipping AI features.

Model swaps silently break prompts

"Improvements" that regress

"Which model is cheapest?"

No CI gate for prompts

How it works.

Version every prompt

Pin test cases

Run cross-model

LLM-judge regression scoring

Block bad PRs

Cost + latency telemetry

Pricing.

Free

Solo

Team

Startup

Ship prompts like you ship code.

Git for prompts.
With eval built in.