Blog

Notes on prompt engineering, regression testing, and shipping LLM features without breaking them.

2026-05-10

How a model upgrade silently broke our extraction prompt (and how we caught it)

How a GPT-4o to GPT-4.1 migration silently broke an extraction prompt — and the 6-line setup that would have caught it.

2026-05-10

Claude Sonnet 4.6 vs GPT-4.1 vs Gemini 2.5 Flash: which wins JSON extraction?

Six frontier models benchmarked on JSON extraction. Haiku is the value pick; Gemini Flash is fast and wrong.

2026-05-10

Prompt regression testing in CI: a 5-minute setup

5-minute setup: pin prompts, write test cases, gate PRs with a GitHub Action. Stop shipping vibes.