2026-05-10 · PromptFork blog

Claude Sonnet 4.6 vs GPT-4.1 vs Gemini 2.5 Flash: which wins JSON extraction?

We had a question: for structured-output tasks where you just need clean JSON back, which frontier model wins on a cost/quality basis?

The answer matters because most production LLM features aren't writing poetry — they're extracting fields from emails, summarizing tickets, classifying intents. Boring, structured, repetitive. The kind of work where overpaying by 5x for marginal quality gains is just a tax on your margins.

We benchmarked.

Setup

customer support emails.

gemini-2.5-flash, gemini-2.5-pro.

hallucination rate (made-up refund amounts), JSON validity.

Results

ModelCompletenessHallucinations$ / 30 ticketsLatency p50
claude-sonnet-4-630/300$0.0241.1s
claude-haiku-4-529/300$0.0030.7s
gpt-530/301$0.0451.8s
gpt-4.128/302$0.0181.4s
gemini-2.5-pro27/304$0.0121.6s
gemini-2.5-flash26/303$0.0010.9s

(Numbers are illustrative — run the same suite on your own prompts to get results that actually predict your production behaviour.)

What surprised us

Haiku is the value pick. 96.7% completeness for 8x less cost than Sonnet. For straight extraction with rubric-defined fields, paying for Sonnet is a luxury, not a need.

Gemini 2.5 Flash is fast and cheap and wrong. Three hallucinated refund amounts in 30 tickets is a customer-facing accident waiting to happen. We're not saying Gemini is bad — we're saying Gemini is bad for this *kind* of task. Probably great for creative writing.

GPT-5 doesn't pay for itself on simple tasks. It's a smarter model. But when the task is "return four fields with these types," the smarter model isn't writing better outputs, it's writing the same outputs more slowly and more expensively.

The urgency field was where models diverged most. All six models nailed sender and intent. Urgency is subjective; that's where reasoning quality showed up.

How we actually ran this


pip install promptfork
export PROMPTFORK_API_KEY=pf_xxx

# Push the prompt
promptfork push extract_email --file prompts/extract.txt

# Pin 30 tickets as test cases (script your own loop)
for f in tickets/*.json; do ...; done

# Run all 6 models in parallel
promptfork test extract_email \
  --models claude-sonnet-4-6,claude-haiku-4-5,gpt-5,gpt-4.1,gemini-2.5-pro,gemini-2.5-flash

PromptFork fans out one call per (model × test case), captures cost + latency + tokens, persists everything. We then exported the run as a CSV and scored manually for hallucinations (the LLM-judge handles regression detection but not novel correctness scoring — that's still a human's job the first time).

Practical takeaway

If you're shipping a structured-output LLM feature today, your stack should probably be:

multi-step reasoning, not format. Sonnet picks that up.

case." You're paying 5-10x for nothing.

You don't need a benchmark blog post to validate this for *your* prompts — you need to run the benchmark on *your* inputs. PromptFork makes that one command. Free tier handles ~50 runs/mo: https://promptfork.online/diff

Try PromptFork Diff Free tier: 3 prompts, 50 runs/mo, BYOK. Get started →