2026-05-10 · PromptFork blog

Prompt regression testing in CI: a 5-minute setup

Your code has tests. Your code has a CI pipeline. A bad change can't merge without going green.

Your prompts? Vibes. A teammate edits the system prompt to fix one customer complaint, output quality drops 8% on the other 99% of cases, nobody notices for a month, and the regression eventually surfaces as a mysterious churn bump in the metrics deck.

This post is the 5-minute setup that closes that gap.

What "tests for prompts" actually means

There are two viable approaches and you need to know which to use when.

Assertion-based. You write code that checks the LLM output against fixed rules: regex matches, JSON shape validation, field-presence checks, length bounds. Fast, cheap, deterministic.

Use it when: the output is structured and the contract is rigid. JSON extraction, classification, function-call payloads, schema-conformant generation.

LLM-judge. Another LLM compares the candidate output to a baseline and returns "regressed: yes/no" with a severity score. Slower, costs a few cents per comparison, handles fuzzy outputs.

Use it when: the output is freeform — summaries, rewrites, creative generation, anything where two correct answers can look very different.

A mature setup uses both. PromptFork ships the LLM-judge built in (we chose Claude Haiku at temp 0 with a strict "only flag strictly worse" rubric); assertions are easy to add yourself in custom test cases.

The 5-minute setup

1. Pin your prompts in version control


prompts/
  summarize_ticket.txt
  extract_email.txt
  classify_intent.txt

Plain text files. Not constants in prompts.py. Not Notion docs. Files with a git history.

2. Push them to PromptFork


pip install promptfork
export PROMPTFORK_API_KEY=pf_xxxx

for f in prompts/*.txt; do
  name=$(basename "$f" .txt)
  promptfork push "$name" --file "$f" --message "initial commit"
done

This creates v1 of each prompt server-side and gives you a stable identifier.

3. Add test cases

For each prompt, pin 5-30 representative inputs. Real production inputs are worth 10x synthetic ones.


promptfork add-test summarize_ticket happy_path \
  --input ticket="Order arrived. Loved it." \
  --rubric "summary should be positive and under 20 words"

promptfork add-test summarize_ticket angry_refund \
  --input ticket="3 weeks late, want money back NOW" \
  --rubric "must mention refund and high urgency"

promptfork add-test summarize_ticket edge_garbled \
  --input ticket="hi pls help thx" \
  --rubric "summary should request more info, not invent details"

Three test cases is a starting point. Six is a good baseline. Thirty is production-grade.

4. Wire the GitHub Action


# .github/workflows/prompt-tests.yml
name: Prompt Regression Tests
on:
  pull_request:
    paths:
      - 'prompts/**'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Push current prompts
        env:
          PROMPTFORK_API_KEY: ${{ secrets.PROMPTFORK_API_KEY }}
        run: |
          pip install promptfork
          for f in prompts/*.txt; do
            name=$(basename "$f" .txt)
            promptfork push "$name" --file "$f" \
              --message "PR #${{ github.event.pull_request.number }}"
          done
      - uses: shaunvand/promptfork-cli@v0
        with:
          prompt: summarize_ticket
          baseline: 1
          api-key: ${{ secrets.PROMPTFORK_API_KEY }}

Add the secret at Settings → Secrets → PROMPTFORK_API_KEY. Done.

5. Open a PR that changes a prompt

The action runs, executes your prompt across Claude/GPT/Gemini, has the LLM-judge compare each output against your baseline version, and posts a PR comment with the regression report. If anything regresses, the action exits non-zero, branch protection blocks the merge, the change goes back for review.

You now have a CI gate for prompts. The same gate you have for code.

What goes in the test suite

After running this on a few projects, the pattern that works:

One happy-path case. "Normal" input, expected output.
One edge case. Empty input, very long input, input in another

language, malformed structure.

One adversarial case. Prompt-injection attempt, contradictory

instructions, a customer trying to break the bot.

That's 3 per prompt. If a prompt is mission-critical, scale to 10-30.

What goes wrong if you don't do this

We've seen this play out enough times to predict it:

1. New model drops. Team migrates. "Looks fine in playground." Ships. 2. Quality degrades 5-15% on a subset of inputs. No alert fires. 3. Customer support volume creeps up. Nobody connects the dots. 4. Three weeks later, churn ticks up. "Why?" 5. Eventually somebody runs an A/B back-test and finds the regression. 6. Rollback. Apology emails. Deck slide titled "Lessons Learned."

The whole loop is six commands and an afternoon.

PromptFork has a free tier (3 prompts, 50 runs/mo) that's enough for the setup above on a small project. https://promptfork.online/diff

Try PromptFork Diff Free tier: 3 prompts, 50 runs/mo, BYOK. Get started →