Haiku 4.5 Findings
Claude Haiku 4.5, when paired with Jimmy’s Workflow, delivered 1.8x faster execution, 67% lower cost, and 5% higher quality scores than Claude Sonnet 4.5 operating without structured workflow. This was the opposite of the expected result.
Executive summary
Section titled “Executive summary”| Metric | Larger model (Sonnet 4.5) | Smaller model + workflow (Haiku 4.5) | Difference |
|---|---|---|---|
| Average speed | 12.9s per query | 7.6s per query | 1.8x faster |
| Cost (input/output per 1M tokens) | $3 / $15 | $1 / $5 | 67% cheaper |
| Quality score | 4.4 / 5 | 4.6 / 5 | +5% (Haiku higher) |
| Success rate | 71% | 86% | +15 percentage points |
Testing was conducted on a content processing platform with a production database of 3,833 articles across 29 RSS sources.
The synergy discovery
Section titled “The synergy discovery”The central finding was not that Haiku is “good enough” — it is that Jimmy’s Workflow eliminates Haiku’s primary weakness while amplifying its primary strength.
Haiku’s weakness: Less implicit inference. Without explicit structure, Haiku makes more assumptions and misses nuance that larger models handle intuitively.
Haiku’s strength: Fast, precise pattern execution. When told exactly what to do and how to validate it, Haiku follows instructions with higher fidelity than Sonnet.
Jimmy’s Workflow provides: Explicit thinking templates, mandatory validation gates, progressive disclosure patterns, and structured output requirements.
The result is a synergy — the workflow compensates for the model’s limitation and the model excels at the workflow’s demands. Haiku does not merely tolerate the structured workflow. It thrives on it.
Quantitative results
Section titled “Quantitative results”| Metric | Sonnet 4.5 | Haiku 4.5 |
|---|---|---|
| Average response time | 12.9s | 7.6s |
| Consistency | Variable (high variance) | Consistent (low variance) |
| Speed advantage | Baseline | 1.8x faster |
Haiku was not only faster on average but more predictable. The lower variance matters for production systems where consistent response times affect user experience.
| Metric | Sonnet 4.5 | Haiku 4.5 |
|---|---|---|
| Input cost (per 1M tokens) | $3.00 | $1.00 |
| Output cost (per 1M tokens) | $15.00 | $5.00 |
| Cost ratio | Baseline | 67% cheaper |
The 67% cost reduction is on the token pricing alone. Combined with the speed improvement and higher success rate, effective cost savings are even larger when accounting for retries and timeouts.
Quality
Section titled “Quality”| Metric | Sonnet 4.5 | Haiku 4.5 + workflow |
|---|---|---|
| Quality score | 4.4 / 5 | 4.6 / 5 |
| Success rate | 71% | 86% |
This was the unexpected result. The smaller model with structured workflow scored higher on quality than the larger model. The explanation lies in the behavioural differences documented below.
What works excellently with smaller models + structured workflow
Section titled “What works excellently with smaller models + structured workflow”Explicit thinking templates
Section titled “Explicit thinking templates”Haiku follows thinking templates more thoroughly than Sonnet. Where Sonnet sometimes abbreviates its reasoning (treating steps as obvious), Haiku executes every step of the template with full detail. The result is more transparent, more auditable reasoning.
Mandatory validation gates
Section titled “Mandatory validation gates”Haiku never skipped a validation gate during testing. Not once. Sonnet occasionally treated validation as a formality, producing abbreviated checks. Haiku treated every gate as a genuine checkpoint, running full validation each time.
Progressive disclosure patterns
Section titled “Progressive disclosure patterns”When given multi-step tasks with progressive complexity, Haiku executed the progression flawlessly. Each step built cleanly on the previous one, with no shortcuts or collapsed steps.
Safety behaviour
Section titled “Safety behaviour”Haiku was more cautious and more detailed in its safety checks than Sonnet. It flagged more edge cases, asked more clarifying questions, and documented more assumptions. For production systems, this is a feature, not a limitation.
Self-correction via validation gates
Section titled “Self-correction via validation gates”When Haiku’s initial output contained an error, the validation gate triggered self-correction. The structured workflow creates a feedback loop that catches and fixes mistakes before they reach the output. This self-healing property is critical for reliability.
What struggles without workflow structure
Section titled “What struggles without workflow structure”Implicit expectations
Section titled “Implicit expectations”Without explicit structure, Haiku’s quality drops measurably. Tasks that rely on the model “knowing what you mean” without being told are where larger models have a genuine advantage. Structured workflow eliminates this gap by making all expectations explicit.
Schema assumptions
Section titled “Schema assumptions”Both Haiku and Sonnet make incorrect assumptions about data schemas when not given explicit schema definitions. This is not a Haiku-specific weakness — it is a universal AI limitation that structured workflow mitigates equally for both model tiers.
Behavioural differences
Section titled “Behavioural differences”| Aspect | Larger model (Sonnet 4.5) | Smaller model + workflow (Haiku 4.5) |
|---|---|---|
| Thinking detail | Moderate — sometimes abbreviates | High — executes every template step |
| Validation steps per task | 3-4 | 4-5 (more thorough) |
| Workflow compliance | Good (occasional shortcuts) | Perfect (100% compliance) |
| Edge case flagging | Moderate | High (more cautious) |
| Self-correction rate | Lower (fewer errors caught) | Higher (validation gates trigger fixes) |
| Reasoning transparency | Implicit leaps | Explicit step-by-step |
The key insight
Section titled “The key insight”Structure matters more than raw model capability for well-defined tasks.
This is the central finding. When the task is well-defined and the workflow is explicit, the model’s “intelligence ceiling” is less important than its ability to follow structured instructions reliably. Haiku’s perfect workflow compliance outweighs Sonnet’s superior implicit reasoning for any task where the workflow can be made explicit.
This does not mean smaller models are universally better. It means that for structured, well-defined tasks — which is the majority of production workloads — investing in workflow design pays higher returns than investing in larger models.
Practical implications
Section titled “Practical implications”-
Use smaller models for structured, well-defined tasks. If you can define the task explicitly with clear inputs, expected outputs, and validation criteria, a smaller model with structured workflow will likely match or beat a larger model.
-
Always provide explicit workflow structure. Never rely on the model “figuring out” what you want. Make thinking steps, validation criteria, and output format explicit.
-
Validation gates are critical. They are not overhead — they are the mechanism that makes smaller models viable. Removing validation gates to save tokens will cost more in retries and quality failures.
-
Cost savings scale linearly with query volume. The 67% per-query savings compound directly with volume.
Projected cost savings
Section titled “Projected cost savings”Assuming average token usage of 2,000 input tokens and 1,000 output tokens per query:
| Daily queries | Sonnet 4.5 daily cost | Haiku 4.5 daily cost | Daily savings | Monthly savings |
|---|---|---|---|---|
| 100 | $2.10 | $0.70 | $1.40 | $42 |
| 1,000 | $21.00 | $7.00 | $14.00 | $420 |
| 10,000 | $210.00 | $70.00 | $140.00 | $4,200 |
These figures reflect token cost only. When factoring in the higher success rate (86% vs 71%), the effective savings increase further because fewer retries are needed.
Limitations
Section titled “Limitations”- Sample size: 7-query test suite. Directionally strong but not statistically robust.
- Single domain: Content processing tasks only. Code generation, creative writing, and other domains were not tested.
- Model versions: Claude Haiku 4.5 and Claude Sonnet 4.5 as available in October 2025. Model updates may change the dynamics.
- Quality assessment: Scores assigned by a single evaluator on a subjective 1-5 scale.
- Workflow dependency: Results are specific to Jimmy’s Workflow v2.1. Other structured workflow systems may produce different outcomes.
These findings inform architecture decisions. They are not a substitute for testing with your own workloads and quality criteria.