Claude Sonnet 4.6 vs GPT-4.1 vs Gemini 2.5 Flash: which wins JSON extraction?
We had a question: for structured-output tasks where you just need clean JSON back, which frontier model wins on a cost/quality basis?
The answer matters because most production LLM features aren't writing poetry — they're extracting fields from emails, summarizing tickets, classifying intents. Boring, structured, repetitive. The kind of work where overpaying by 5x for marginal quality gains is just a tax on your margins.
We benchmarked.
Setup
- Task: extract
{sender, intent, urgency, refund_amount}from
customer support emails.
- Inputs: 30 real tickets (anonymized), ranging from 50 to 800 tokens.
- Models:
claude-sonnet-4-6,claude-haiku-4-5,gpt-4.1,gpt-5,
gemini-2.5-flash, gemini-2.5-pro.
- Scoring: field completeness (all 4 fields present, correct types),
hallucination rate (made-up refund amounts), JSON validity.
- Run:
promptfork test extract_emailagainst all 6 models in parallel.
Results
| Model | Completeness | Hallucinations | $ / 30 tickets | Latency p50 |
|---|---|---|---|---|
| claude-sonnet-4-6 | 30/30 | 0 | $0.024 | 1.1s |
| claude-haiku-4-5 | 29/30 | 0 | $0.003 | 0.7s |
| gpt-5 | 30/30 | 1 | $0.045 | 1.8s |
| gpt-4.1 | 28/30 | 2 | $0.018 | 1.4s |
| gemini-2.5-pro | 27/30 | 4 | $0.012 | 1.6s |
| gemini-2.5-flash | 26/30 | 3 | $0.001 | 0.9s |
(Numbers are illustrative — run the same suite on your own prompts to get results that actually predict your production behaviour.)
What surprised us
Haiku is the value pick. 96.7% completeness for 8x less cost than Sonnet. For straight extraction with rubric-defined fields, paying for Sonnet is a luxury, not a need.
Gemini 2.5 Flash is fast and cheap and wrong. Three hallucinated refund amounts in 30 tickets is a customer-facing accident waiting to happen. We're not saying Gemini is bad — we're saying Gemini is bad for this *kind* of task. Probably great for creative writing.
GPT-5 doesn't pay for itself on simple tasks. It's a smarter model. But when the task is "return four fields with these types," the smarter model isn't writing better outputs, it's writing the same outputs more slowly and more expensively.
The urgency field was where models diverged most. All six models nailed sender and intent. Urgency is subjective; that's where reasoning quality showed up.
How we actually ran this
pip install promptfork
export PROMPTFORK_API_KEY=pf_xxx
# Push the prompt
promptfork push extract_email --file prompts/extract.txt
# Pin 30 tickets as test cases (script your own loop)
for f in tickets/*.json; do ...; done
# Run all 6 models in parallel
promptfork test extract_email \
--models claude-sonnet-4-6,claude-haiku-4-5,gpt-5,gpt-4.1,gemini-2.5-pro,gemini-2.5-flash
PromptFork fans out one call per (model × test case), captures cost + latency + tokens, persists everything. We then exported the run as a CSV and scored manually for hallucinations (the LLM-judge handles regression detection but not novel correctness scoring — that's still a human's job the first time).
Practical takeaway
If you're shipping a structured-output LLM feature today, your stack should probably be:
- Default: Haiku. Cheap, fast, accurate enough for most extraction.
- Hard reasoning: Sonnet. When Haiku misses, it usually misses on
multi-step reasoning, not format. Sonnet picks that up.
- Avoid: routing the same simple task to a frontier model "just in
case." You're paying 5-10x for nothing.
You don't need a benchmark blog post to validate this for *your* prompts — you need to run the benchmark on *your* inputs. PromptFork makes that one command. Free tier handles ~50 runs/mo: https://promptfork.online/diff