2026-05-10 · PromptFork blog

How a model upgrade silently broke our extraction prompt (and how we caught it)

A friend's product summarizes customer support tickets using a fine-tuned LLM prompt. It worked perfectly on GPT-4o for six months. Then OpenAI deprecated 4o, the team migrated to GPT-4.1, ran a smoke test in the playground, said "looks fine," and shipped.

Two weeks later a customer escalated: "Your urgency tagging is wrong on basically everything since last Wednesday."

The prompt asked for {"intent": "...", "urgency": "low|medium|high"}. On 4o, the model returned exactly that. On 4.1, it started returning {"intent": "...", "urgency_level": "..."} — semantically identical, but the downstream classifier was indexing on urgency and silently fell through to a default value of "low" on 100% of new tickets.

Nobody saw it because:

This is the silent regression problem. Code has tests; prompts have vibes.

Three categories of model-swap failure

After looking at a dozen of these incidents, the failures cluster into three groups. Knowing which kind you're looking at tells you what to test.

1. Format drift. The model decides to rename a field, drop a field, add a field you didn't ask for, or change list ordering. JSON still parses. Your downstream code breaks.

2. Reasoning regression. The model is "improved" but loses a hidden constraint your prompt depended on. Classic example: GPT-4 reliably extracted *all* requirements from a contract; GPT-4-Turbo extracted "the most important ones," dropping 15-20% of clauses. The format was fine. The data was wrong.

3. Tone shift. Less common but expensive. The new model's outputs are more verbose, less verbose, friendlier, blunter. If anything downstream (another model, a regex, a fuzzy matcher) was tuned to the old tone, it breaks.

What the team should have had

A test suite of 30 representative tickets, each with an expected JSON shape. On model swap day:


$ promptfork test summarize_ticket --baseline gpt-4o
→ running v12 across [gpt-4.1] vs baseline [gpt-4o]
✗ 30/30 ok, but 6 regressions detected
  - urgency_field_renamed: 6 cases
  - severity 2 (functional)

Six lines. Seven seconds. Two-week customer-facing bug avoided.

How to actually do this

The setup for the team that got bitten took four minutes:


pip install promptfork

# Save the current production prompt, version 1
promptfork push summarize_ticket \
  --file prompts/summarize.txt \
  --message "current prod"

# Pin 30 real tickets from your support inbox
for t in tickets/*.json; do
  name=$(basename "$t" .json)
  promptfork add-test summarize_ticket "$name" \
    --input ticket="$(cat "$t")" \
    --rubric "must return urgency in {low,medium,high}"
done

# Run baseline on 4o
promptfork test summarize_ticket --models gpt-4o

# Now upgrade — push the new prompt as v2 (or keep v1 and swap models)
# Run with v1 (4o) as the baseline, get an LLM-judge regression report
promptfork test summarize_ticket --baseline 1 --models gpt-4.1

That's it. The --baseline flag is what catches drift — it pulls the baseline output, runs the candidate, and asks Claude Haiku to compare them under a strict "only flag strictly worse" rubric.

The CI version

The same command in a GitHub Action means *no prompt change ever ships* without running against a known-good baseline:


- uses: shaunvand/promptfork-cli@v0
  with:
    prompt: summarize_ticket
    baseline: 1
    api-key: ${{ secrets.PROMPTFORK_API_KEY }}

The action exits non-zero on regression. Branch protection blocks the merge.

If you ship LLM features, you need this. The first time it catches a silent regression, it pays for itself a hundred times over. PromptFork has a free tier (3 prompts, 50 runs/mo) at https://promptfork.online/diff — set it up in five minutes, sleep better forever.

Try PromptFork Diff Free tier: 3 prompts, 50 runs/mo, BYOK. Get started →