Version every prompt. Diff outputs across Claude, GPT, and Gemini. Fail your CI on regressions. One command, one dashboard, one source of truth.
Type a prompt, hit Run, see all 3 models side-by-side with cost + latency.
If your product calls an LLM in production, every prompt change is a deploy waiting to bite you. PromptFork Diff is the harness you should already have.
GPT-4o → 4.1, Claude 3.5 → 4.6 — output shape shifts, JSON fields drop, tone changes. You find out from a customer ticket.
A teammate tweaks a prompt to fix one case. Two weeks later, your extraction quality is down 8% — and nobody knows when.
Currently answered by a junior dev pasting 5 inputs into 3 playgrounds. Half a day of work, every model release.
Code has tests. Prompts have vibes. A bad prompt change passes review and breaks features in prod.
Five primitives. No lock-in. BYOK supported.
Like git for prompt strings — semver-style tags, a diff view, one-click rollback.
JSON or YAML inputs + expected outputs. Lives next to the prompt, version-controlled together.
Claude, GPT, Gemini, in parallel. Captures cost, latency, token usage per call.
Haiku ensemble votes on whether the candidate output is strictly worse than baseline. Severity 1–3.
GitHub Action runs `promptfork test`, comments the diff matrix, exits non-zero on regression.
Every call captures input/output tokens, $ cost, and latency. Pick the cheapest model that still passes your suite.
BYOK keeps inference costs your own. We charge for the harness.